Mengdan Zhu, Senhao Cheng, Liang Zhao
View original ↗Build a lightweight plug-in for popular VLMs that performs recursive latent visual decomposition before answering queries. This would drastically improve accuracy on complex spatial and multi-step visual reasoning tasks.
Suggested repo: latent-look
"Stop guessing visual context—decompose, look, and reason."
Estimated effort: 40h