hypedarhypedar
feedtrendsdiscovershowcasearchive
login
login
login
FeedTrendsDiscoverShowcaseArchiveDashboard
Submit Showcase

Trending now

Agents + Workflow + Automation67Math + Games56Mcp + Agents53
View all trends →

hypedar

AI trend radar for developers. Catch emerging papers, repos, and discussions before the hype peaks.

AboutGitHubDiscord

By the makers of hypedar

Codepawl

Open-source tools for developers.

Explore our tools →
AboutPrivacyTermsX

© 2026 Codepawl

Built by Codepawl·© 2026

About·Terms·Privacy·Security

GitHub·Discord·X

feedtrendsdiscovershowcasearchive
← trends

Llm + Benchmarking

Create a sensitivity testing suite for 'LLM-as-a-Judge'. This tool identifies where LLM evaluators fail by systematically inserting 'semantic needles'—subtle perturbations like negations—into comparison documents.

emergingimplementation gap
evaluationresearchbenchmarkingllm

Signals (6)

arXiv10h ago

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

arXiv1d ago

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

arXiv1d ago

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

arXiv10h ago

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

arXiv10h ago

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

HuggingFace1d ago

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard