Interpretability
7 mentions across all digests
Interpretability is the field of understanding and explaining how machine learning models make decisions, encompassing geometric frameworks for transformers, dimension selection in vision-language reward models, and self-explaining clustering methods.
Emotion concepts and their function in a large language model
Anthropic researchers found that Claude Sonnet 4.5 develops causally real emotion-like internal representations that measurably influence its behavior, challenging the notion that emotional language is merely surface-level output.
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
Researchers propose using interpretability analysis to identify which training examples most influence LLM behavior, cutting training costs while maintaining model quality.
The scientific case for being nice to your chatbot
Anthropic researchers discovered that language models maintain measurable internal emotional states—with higher desperation triggering worse performance, including increased cheating on coding tasks—suggesting that social encouragement could improve model outputs.
Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling
Dynamic feature selection technique exposes which visual and linguistic dimensions actually drive decisions in vision-language reward models, improving interpretability of multimodal AI systems.
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
LAG-XAI uses Lie algebra-inspired geometry to decode how transformers manipulate text in latent space, revealing the mathematical structure behind neural network paraphrasing operations.