Decomposing Language Models Into Understandable Components - Nick the Sick's blog. Writings, projects and ideas.

# Decomposing Language Models Into Understandable Components - Anthropic Synced: [[2023_11_30]] 6:03 AM Last Highlighted: [[2023_10_16]] Tags: [[AI]] [[Explainer]] [[Model]] ![rw-book-cover](https://efficient-manatee.transforms.svdcdn.com/production/images/Untitled-Artwork-11.png?w=1200&h=630&q=82&auto=format&fit=crop&dm=1696477668&s=fe41beb80074843426e455ad571ac77f) ## Highlights [[2023_10_16]] [View Highlight](https://read.readwise.io/read/01hcw03gspztkb051rz29nj15b) > Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, [a single neuron](https://transformer-circuits.pub/2023/monosemantic-features/vis/a-neurons.html#feature-83) in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a [single neuron](https://distill.pub/2017/feature-visualization/#diversity) responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts [[2023_10_16]] [View Highlight](https://read.readwise.io/read/01hcw03v5vx4wdakkpak7yd9er) > In our latest paper, *[Towards Monosemanticity: Decomposing Language Models With Dictionary Learning](https://transformer-circuits.pub/2023/monosemantic-features/index.html)*, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics.