Create an evaluation harness specifically for medical reasoning capabilities in OSS models. This fills the gap left by general-purpose benchmarks that aren't context-aware.