Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane
View original ↗Create an evaluation harness that tests an LLM's 'self-correction' ability in the face of conflicting reasoning traces. Build an OSS 'metacognition suite' for developers to grade how well models handle their own uncertainty.
Suggested repo: metacognit
"Benchmark if your model actually thinks or just guesses."
Estimated effort: 50h