feed trends discover showcase archive

Feed Trends Discover Showcase Archive Dashboard

Submit Showcase

Trending now

Workflow + Code Generation + Automation62 Security + Llm56 Policy + Ethics53

View all trends →

hypedar

AI trend radar for developers. Catch emerging papers, repos, and discussions before the hype peaks.

About GitHub Discord

By the makers of hypedar

Codepawl

Open-source tools for developers.

Explore our tools →

About Privacy Terms X

© 2026 Codepawl

Built by Codepawl·© 2026

About·Terms·Privacy·Security

GitHub·Discord·X

feed trends discover showcase archive

Evaluation + Agents + Llm | hypedar

Evaluation + Agents + Llm

66.0

Develop an automated evaluation harness that stress-tests AI agents against adversarial benchmark inputs to identify common failure modes. This tool would allow researchers to standardize robustness testing across different LLM backends.

+66

activeimplementation gap

roboticsllmreasoningbenchmarksevaluationalignmentagents

Signals (46)

Medical Reasoning with Large Language Models: A Survey and MR-Bench

SAGE: A Service Agent Graph-guided Evaluation Benchmark

Artifacts as Memory Beyond the Agent Boundary

Exploiting the most prominent AI agent benchmarks

Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study

CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

firecrawl/firecrawl

TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

DRAFT: Task Decoupled Latent Reasoning for Agent Safety

RAGEN-2: Reasoning Collapse in Agentic RL

Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68%

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models

Neural networks for Text-to-Speech evaluation

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

ATANT: An Evaluation Framework for AI Continuity

A better method for identifying overconfident large language models

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

Scientists invented a fake disease. AI told people it was real

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

Robust Reasoning Benchmark

Riemann-Bench: A Benchmark for Moonshot Mathematics

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Claude mixes up who said what and that's not OK

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

StaRPO: Stability-Augmented Reinforcement Policy Optimization

Google's AI Overviews spew false answers per hour, bombshell study reveals

Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs

Google AI4d ago

ConvApparel: Measuring and bridging the realism gap in user simulators

Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents

Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

AMD AI director says Claude Code is becoming dumber and lazier since update

GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean

SynDocDis: A Metadata-Driven Framework for Generating Synthetic Physician Discussions Using Large Language Models

Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

Show HN: I Built Paul Graham's Intellectual Captcha Idea

A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation