View all trends →

hypedar

AI trend radar for developers. Catch emerging papers, repos, and discussions before the hype peaks.

About GitHub Discord

By the makers of hypedar

Codepawl

Open-source tools for developers.

Explore our tools →

About Privacy Terms X

About·Terms·Privacy·Security

GitHub·Discord·X

feed trends discover showcase archive

Evaluation + Reasoning | hypedar

← trends

Evaluation + Reasoning

Develop a robustness benchmark tool that tests if an LLM solves math problems regardless of how the problem is expressed. This will expose fragility in current reasoning benchmarks.

emergingimplementation gap

reasoningevaluationgeometryagents

Signals (18)

arXiv7h ago

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

arXiv1d ago

Spotlights and Blindspots: Evaluation Machine-Generated Text Detection

arXiv2d ago

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

arXiv1d ago

Measuring Representation Robustness in Large Language Models for Geometry

YHN6d ago

AI-assisted cognition endangers human development?

arXiv1d ago

SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future

arXiv1d ago

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

GitHub9h ago

langfuse/langfuse

arXiv7h ago

Disparities In Negation Understanding Across Languages In Vision-Language Models

arXiv1d ago

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

arXiv7d ago

Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

arXiv1d ago

Towards Rigorous Explainability by Feature Attribution

arXiv7h ago

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

arXiv1d ago

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

GitHub5d ago

datawhalechina/hello-agents

arXiv7h ago

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

arXiv5d ago

Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation

arXiv1d ago

Evaluation + Reasoning

Signals (18)

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

Spotlights and Blindspots: Evaluation Machine-Generated Text Detection

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

Measuring Representation Robustness in Large Language Models for Geometry

AI-assisted cognition endangers human development?

SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

langfuse/langfuse

Disparities In Negation Understanding Across Languages In Vision-Language Models

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

Towards Rigorous Explainability by Feature Attribution

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

datawhalechina/hello-agents

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams