Unveiling Decision-Making in LLMs for Text Classification: Extraction of Influential and Interpretable concepts with Sparse Autoencoders” has been accepted at Findings of EACL 2026.

Article by

Article date

February 24th, 2026

Category

Large Language Models achieve strong performance in text classification, but their internal decision-making process remains largely opaque. What concepts do they actually rely on when assigning a label ? Can we extract features that are both interpretable and causally influential on the decision ?

In this work, we introduce ClassifSAE, a supervised Sparse Autoencoder (SAE) architecture designed for explainability in sentence classification. We align the SAE behavior with the classification task by jointly training the SAE and a lightweight classifier, encouraging task-relevant information to concentrate into a small set of interpretable latent features. We also introduce an activation-rate sparsity mechanism to promote monosemantic and less correlated concepts.

Beyond the model itself, we propose two new evaluation metrics, ConceptSim and SentenceSim, to quantify the precision and coherence of the extracted concepts using an external sentence encoder. This allows us to move beyond qualitative inspection and systematically evaluate interpretability.

Across multiple datasets and seven backbone LLMs, ClassifSAE:

Produces significantly more interpretable and sparser concepts
Discovers concepts with strong causal influence on predictions (second-best overall)
Requires up to 83% less training time than prior competitive approaches on largest models

Our results highlight a possible trade-off between causality and interpretability and show that task-aware sparse representations can substantially improve concept extraction for LLM classifiers.

This work was done jointly with Jérémie Dentan, Davide Buscaldi and Sonia Vanier, as part of the "Responsible and Trustworthy AI" research chair between École Polytechnique and Groupe Crédit Agricole.

Looking forward to presenting it at EACL 2026!

Paper: https://arxiv.org/abs/2506.23951
Code: https://github.com/orailix/ClassifSAE

Our paper "Unveiling Decision-Making in LLMs for Text Classification: Extraction of Influential and Interpretable concepts with Sparse Autoencoders" has been accepted at Findings of EACL 2026.