Hye Sun Yun, Geetika Kapoor, Michael Mackert, Ramez Kouzy, Wei Xu, Junyi Jessy Li, Byron C. Wallace
View original ↗Create an evaluation framework for medical QA systems to test prompt sensitivity. Developers can use this to benchmark their existing RAG pipelines for consistency.
Suggested repo: med-judge
"Does your medical RAG system change answers based on phrasing? Find out."
Estimated effort: 30h