Evaluation + Llm + Reasoning

29.0

Build a benchmark framework that tests 'continuity'—the ability of an agent system to maintain truth over long sessions. Move beyond simple RAG retrieval metrics to 'persistence' testing.

+22

emergingimplementation gap

mathreasoningevaluationdisinformationllmqualityragsafetyagents

Signals (11)

arXiv12h ago

CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

arXiv12h ago

Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

YHN1d ago

Evaluation + Llm + Reasoning

Signals (11)

CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

ATANT: An Evaluation Framework for AI Continuity

Scientists invented a fake disease. AI told people it was real

Riemann-Bench: A Benchmark for Moonshot Mathematics

Claude mixes up who said what and that's not OK

Google's AI Overviews spew false answers per hour, bombshell study reveals

Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

ConvApparel: Measuring and bridging the realism gap in user simulators

Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

AMD AI director says Claude Code is becoming dumber and lazier since update