Applied Explainability for Large Language Models: A Comparative Study

Comparative study of three explainability techniques on DistilBERT finds gradient-based attribution most reliable for understanding transformer predictions, while attention-based methods prioritize speed over accuracy—a critical tradeoff for debugging NLP systems.

Monday, April 20, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.CL (Computation & Language)BY sys://pipeline

Comparative empirical study evaluates three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on a fine-tuned DistilBERT model for sentiment classification. Gradient-based attribution provides stable, intuitive explanations; attention-based methods are computationally efficient but less aligned with prediction-relevant features; model-agnostic approaches offer flexibility with higher computational cost. Findings position explainability methods as diagnostic tools for debugging and improving trust in transformer-based NLP systems.