Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan
View original ↗Build a psychometric 'difficulty map' for current coding benchmarks like SWE-bench. Help developers understand *why* models fail specific types of tasks.
Suggested repo: BenchPsy
"Map out where your coding agent is failing and why."
Estimated effort: 15h