Evaluation + Benchmarking

10.0

Create an open-source evaluation suite that probes LLM bias using a hierarchical taxonomy across multiple task types. This tool should demonstrate how alignment wrappers can be bypassed using task-switching, helping developers audit models more effectively.

emergingimplementation gap

llmevaluationbenchmarkingethicsalignment

Signals (2)

arXiv2h ago

Position: Science of AI Evaluation Requires Item-level Benchmark Data

arXiv1d ago

Evaluation + Benchmarking

Signals (2)

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

Evaluation + Benchmarking

Signals (2)

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments