mechanistic interpretability
8 mentions across all digests
Mechanistic interpretability is a research field focused on reverse-engineering the internal computations of neural networks, with active work on sparse autoencoders, knowledge editing via circuit identification, and understanding how models represent truthfulness internally.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
Weight patching technique enables researchers to pinpoint the exact locations within LLM architectures where specific behaviors originate, advancing mechanistic interpretability of neural networks.
Mechanistic Circuit-Based Knowledge Editing in Large Language Models
Researchers identify specific internal circuits in LLMs that encode factual knowledge, enabling surgical edits to model facts with full interpretability of mechanism-level changes.
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
A decomposability penalty during sparse autoencoder training produces more isolated, interpretable features—advancing mechanistic interpretability by reducing representation entanglement.
Testing the Limits of Truth Directions in LLMs
Study probes where LLMs' internal truth directions break down, revealing mechanistic limits in how language models encode truthfulness.
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Researchers locate a sparse routing circuit governing alignment policies in language models, enabling precise control over refusal behavior—validated across 9 models from 6 major labs.