Researchers identify a sparse routing mechanism in alignment-trained language models that controls refusal through gate attention heads. Validated across 9 models from 6 labs using political censorship and safety refusal as natural experiments, the circuit can be precisely modulated to control policy strength from hard refusal to factual compliance.
Safety
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Researchers locate a sparse routing circuit governing alignment policies in language models, enabling precise control over refusal behavior—validated across 9 models from 6 major labs.
Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline
Tags
safety