Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, Ulf Schlichtmann
View original ↗Build a caching kernel that achieves recomputation-free attention. This will significantly reduce the latency of long-context LLM serving by decoupling KV caches from specific input contexts.
Suggested repo: KVPack
"True zero-recomputation context switching for high-throughput LLM inference."
Estimated effort: 120h