arXiv2d ago

This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

Hye Sun Yun, Geetika Kapoor, Michael Mackert, Ramez Kouzy, Wei Xu, Junyi Jessy Li, Byron C. Wallace

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty4/10

Categorypaper

Topics

ragevaluation

Opportunity Brief

Create an evaluation framework for medical QA systems to test prompt sensitivity. Developers can use this to benchmark their existing RAG pipelines for consistency.

Suggested repo: med-judge

"Does your medical RAG system change answers based on phrasing? Find out."

Estimated effort: 30h