View all trends →

hypedar

AI trend radar for developers. Catch emerging papers, repos, and discussions before the hype peaks.

About GitHub Discord

By the makers of hypedar

Codepawl

Open-source tools for developers.

Explore our tools →

About Privacy Terms X

About·Terms·Privacy·Security

GitHub·Discord·X

feed trends discover showcase archive

← trends

Llm + Benchmarking

Create a sensitivity testing suite for 'LLM-as-a-Judge'. This tool identifies where LLM evaluators fail by systematically inserting 'semantic needles'—subtle perturbations like negations—into comparison documents.

emergingimplementation gap

evaluationresearchbenchmarkingllm

Signals (6)

arXiv10h ago

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

arXiv1d ago

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

arXiv1d ago

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

arXiv10h ago

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

arXiv10h ago

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

HuggingFace1d ago

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

← trends

Llm + Benchmarking

emergingimplementation gap

evaluationresearchbenchmarkingllm

Signals (6)

arXiv10h ago

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

arXiv1d ago

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

arXiv1d ago

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

arXiv10h ago

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

arXiv10h ago

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

HuggingFace1d ago