Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu
View original ↗Build a standardized 'rubric-scoring' library that wraps LLM evaluations in a transparent validation framework. This would allow developers to define complex business-logic rubrics for their own benchmarks.
Suggested repo: rubric-eval
"Stop guessing your LLM's performance with fuzzy benchmarks."
Estimated effort: 60h