arXiv2d ago

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Yongchang Hao, Lili Mou

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty8/10

Categorypaper

Topics

inferencellmsamplingspeculative-decoding

Opportunity Brief

Create a high-performance C++/CUDA implementation of constrained acceptance speculative sampling for common open-source LLMs. This helps developers achieve higher throughput without sacrificing distribution accuracy.

Suggested repo: cactus-decode

"Accelerate your LLM inference with flexible, distribution-aware speculative sampling."

Estimated effort: 80h