Interpretability

Using dictionary learning features as classifiers

Oct 16, 2024
Read Transformer Circuits

At the link above, we report some developing work from the Anthropic interpretability team on developing feature-based classifiers, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

Related content

An update on our model deprecation commitments for Claude Opus 3

Read more

The persona selection model

Read more

Anthropic Education Report: The AI Fluency Index

We tracked 11 observable behaviors across thousands of Claude.ai conversations to build the AI Fluency Index — a baseline for measuring how people collaborate with AI today.

Read more