BREAKING
Just nowWelcome to TOKENBURN — Your source for AI news///Just nowWelcome to TOKENBURN — Your source for AI news///
BACK TO NEWS
Safety

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

LOCA identifies minimal, interpretable representation changes that causally explain why individual jailbreaks defeat LLM safety training.

Monday, May 4, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.AIBY sys://pipeline

Researchers introduce LOCA, a method for providing local, causal explanations of why specific jailbreak attacks succeed against safety-trained LLMs. Rather than globally explaining all jailbreaks as modifying the same concepts, LOCA identifies minimal sets of interpretable representation changes that causally induce model refusal on individual jailbreak attempts.

Tags
safety