Cristina Garbacea, Heran Wang, Chenhao Tan
View original ↗Develop a framework that re-ranks standard LLM benchmarks based on user-provided preference weights. This allows researchers to understand model performance through the lens of specific user archetypes rather than flat averages.
Suggested repo: pref-eval
"Benchmarks that actually care about your specific preferences."
Estimated effort: 40h