InterpretabilityResearch

Toy Models of Superposition

Sep 14, 2022
Read Paper

Abstract

In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition. When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.

Related content

Estimating AI productivity gains from Claude conversations

Read more

Mitigating the risk of prompt injections in browser use

Read more

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Read more