Yousra Fettach, Guillaume Bied, Hannu Toivonen, Tijl De Bie
View original ↗Build a benchmark suite that evaluates LLMs on subjective human cultural alignment tasks like humor. Developers should create an extensible framework to ingest varied datasets (e.g., jokes, puns, sarcasm) to measure 'humor-alignment' across different models.
Suggested repo: nanoEval-Humor
"Is your model actually funny or just regurgitating datasets?"
Estimated effort: 20h