Yongchang Hao, Lili Mou
View original ↗Create a high-performance C++/CUDA implementation of constrained acceptance speculative sampling for common open-source LLMs. This helps developers achieve higher throughput without sacrificing distribution accuracy.
Suggested repo: cactus-decode
"Accelerate your LLM inference with flexible, distribution-aware speculative sampling."
Estimated effort: 80h