sixhobbits
View original ↗Develop an automated evaluation suite that specifically tests multi-turn dialogue coherence regarding speaker attribution. This tool should identify instances where models conflate identities in complex chat logs to serve as a standard benchmarking dataset.
Suggested repo: speakerGuard
"Detect when your LLM forgets who said what before your users do."
Estimated effort: 40h