1 Introduction

In the field of AI Alignment, ‘Interpretability’ is the study of understanding neural networks; what they learn from training data and how they are forming their predictions.

The first step in interpretability is typically to understand the features a neuron is learning from. There would be no mystery if neurons corresponded to verifiable input features. For example, if a neuron fires on dog tails, or on Korean poetry. Since neural networks incorporate non-linearities, this is not always the case, and we will see how small a fraction of features we are able to extract in relation to the number of neurons in a network. This phenonemon is known as ‘superposition’¹.

Let’s consider the canonical MNIST machine learning example. MNIST is a dataset of 60,000 training images of handwritten digits, 0-9 (10 classes). Each image is 28x28 pixels, so 784 pixels in total.

In the most interpretable scenario, each neuron in a neural network would correspond to a specific feature of the input, for example:

loops (for 0, 6, 8, 9 etc)
straight lines (1, 4, 7)
curves (2, 3, 5)

In the image below, we see the original handwritten digits, and a heatmap showing the regions of the digits that had high predictive value. Notice for example how the criss-cross of the eight, which is unique to the digit, is highlighted with blue (strong correlation), and similarly, the curve of the five.

In practice, however, neural nets don’t learn clean, one-to-one mappings between neurons and features.

1.0.1 Why does superposition occur?

Efficiency: superposition enables networks to represent more complex patterns and relationships without necessarily increasing their size. Networks can adapt to allocating their capacity based on the complexity of the task.
Generalization: overlapping representations help generalize to new data.
Non-linearity: this allows for complex, overlapping representations. For example, the combination of features in different neurons, sushi in one and recipie quantities in another, can help formulate accurate predictions.

We will see in this short course that neural networks often represent more features than they have dimensions, and mix different, unrelated concepts in single neurons. For example, a neuron in a language model could fire in response to inputs as varied as code in Haskell, Spanish poetry and vehicle descriptions.

Let’s move on to the next chapter to examine this fascinating field and look closely at what we can see neural networks are doing, and what we yet cannot.

The phenomenon is also know as ‘polysemanticicy’. We will use the term ‘superposition’ in this course.