Chen Minqi, Zhongqi Yue, Shihao Zhang, Yun Xu, Peng Wu, kaixiang Xu, Zeyi Huang, Hanwang Zhang
View original ↗Create an optimized kernel for 2D/3D RoPE that avoids vector-level split/merge overhead. A Triton or CUDA implementation would significantly accelerate long-context vision-language models.
Suggested repo: rope-fast
"Native 3D RoPE kernels that finally stop the overhead bottleneck."
Estimated effort: 30h