Large Language Models achieve strong performance in text classification, but their internal decision-making process remains largely opaque. What concepts do they actually rely on when assigning a label ? Can we extract features that are both interpretable and causally influential on the decision ?
In this work, we introduce ClassifSAE, a supervised Sparse Autoencoder (SAE) architecture designed for explainability in sentence classification. We align the SAE behavior with the classification task by jointly training the SAE and a lightweight classifier, encouraging task-relevant information to concentrate into a small set of interpretable latent features. We also introduce an activation-rate sparsity mechanism to promote monosemantic and less correlated concepts.
Beyond the model itself, we propose two new evaluation metrics, ConceptSim and SentenceSim, to quantify the precision and coherence of the extracted concepts using an external sentence encoder. This allows us to move beyond qualitative inspection and systematically evaluate interpretability.
Across multiple datasets and seven backbone LLMs, ClassifSAE:
Our results highlight a possible trade-off between causality and interpretability and show that task-aware sparse representations can substantially improve concept extraction for LLM classifiers.
This work was done jointly with Jérémie Dentan, Davide Buscaldi and Sonia Vanier, as part of the "Responsible and Trustworthy AI" research chair between École Polytechnique and Groupe Crédit Agricole.
Looking forward to presenting it at EACL 2026!