mackbrowne
View original ↗Build a benchmark harness that measures agent performance degradation as a function of compute cost. This allows researchers to quantify the 'intelligence-per-dollar' trade-off in complex reasoning loops.
Suggested repo: gambling-agent-bench
"When models go broke, how smart do they stay?"
Estimated effort: 20h