Safety + Rl

20.0

Create an agent-agnostic 'recovery' layer that intercepts dangerous system states. Implement a mechanism that guides an agent back to a safe baseline after a harmful action.

emergingimplementation gap

rltrainingreasoningjailbreakllmsecurityverificationsafetyagents

Signals (7)

arXiv1d ago

The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification

arXiv1d ago

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

arXiv5h ago

Safety + Rl

Signals (7)

The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Peer-Preservation in Frontier Models

Human-Guided Harm Recovery for Computer Use Agents

Can We Locate and Prevent Stereotypes in LLMs?

Reasoning Structure Matters for Safety Alignment of Reasoning Models

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Safety + Rl

Signals (7)

The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

Peer-Preservation in Frontier Models

Human-Guided Harm Recovery for Computer Use Agents

Can We Locate and Prevent Stereotypes in LLMs?

Reasoning Structure Matters for Safety Alignment of Reasoning Models

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System