arXiv9h ago

Agent psychometrics: Task-level performance prediction in agentic coding benchmarks

Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, Kaivalya Hariharan

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty6/10

Categorypaper

Topics

agentscode-generationevaluation

Opportunity Brief

Build a psychometric 'difficulty map' for current coding benchmarks like SWE-bench. Help developers understand *why* models fail specific types of tasks.

Suggested repo: BenchPsy

"Map out where your coding agent is failing and why."

Estimated effort: 15h