arXiv3h ago

Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation

Leen AlQadi, Ahmed Alzubaidi, Mohammed Alyafeai, Hamza Alobeidli, Maitha Alhammadi, Shaikha Alsuwaidi, Omar Alkaabi, Basma El Amel Boussaha, Hakim Hacid

View original ↗

Analysis

Viral velocity

low

Implementation gapYES

Novelty5/10

Categorytool

Topics

evaluationinference

Opportunity Brief

Create an automated Arabic LLM evaluation pipeline that performs multi-model verification. This sets a new standard for localized language benchmarks.

Suggested repo: QIMMA-eval

"Finally, an Arabic benchmark that actually measures quality, not just scale."

Estimated effort: 60h