Han Jiang, Susu Zhang, Xiaoyuan Yi, Xing Xie, Ziang Xiao
View original ↗Build a data-centric evaluation platform that allows developers to drill down into item-level performance of models. Moving beyond aggregated scores is critical for high-stakes AI deployment.
Suggested repo: item-eval
"Go beyond aggregate scores: diagnostic AI evaluation."
Estimated effort: 50h