Interpretability
Using dictionary learning features as classifiers
Oct 16, 2024
At the link above, we report some developing work from the Anthropic interpretability team on developing feature-based classifiers, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.
Related content
Anthropic Economic Index: new building blocks for understanding AI use
Read moreAnthropic Economic Index report: economic primitives
This report introduces new metrics of AI usage to provide a rich portrait of interactions with Claude in November 2025, just prior to the release of Opus 4.5.
Read more