Build a simple cross-modal retrieval system (image-to-text and text-to-image) using recent multimodal sentence transformers.
Suggested repo: clip-search-nano
"Find images, code, and text in one unified vector space."
Estimated effort: 20h