Week of October 2, 2023
Comparing Anthropic’s Dictionary Learning to Ours • Readers may have noticed many similarities between Anthropic’s recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team’s recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach. • (LessWrong, Robert Aizi) / October 7
Tiny Language Models Come of Age • Learning English is no easy task, as countless students well know. But when the student is a computer, one approach works surprisingly well: Simply feed mountains of text from the internet to a giant mathematical model called a neural network. That’s the operating principle behind generative language models like OpenAI’s ChatGPT, whose ability to converse coherently (if not always truthfully) on a wide range of topics has surprised researchers and the public over the past year. But the approach has its drawbacks. For one thing, the “training” procedure required to transmute vast text archives into state-of-the-art language models is costly and time-intensive. For another, even the people who train large language models find it hard to understand their inner workings; that, in turn, makes it hard to predict the many ways they can fail. Faced with these difficulties, some researchers have opted to train smaller models on smaller data sets and then study their behavior. Now, in a paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories. • (Quanta Magazine, Ben Brubaker) / October 5
Decomposing Language Models Into Understandable Components • Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don’t understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe. In our latest paper, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics. • (Anthropic) / October 5
Train a language model from scratch • The vast majority of time, fine-tuning a LLM yields the best results. But when making significant changes to the structure of a model, training from scratch is often required. Examples of significant changes are: (1) Changing the vocabulary size; (2) Changing the number of hidden dimensions; and (3) Changing the number of attention heads or layers. This article will show how to build a new tokenizer and train a small language model (known as a micromodel – a model with fewer than 1M parameters, less than 5MB in size, and can be trained with a single GPU in hours) from scratch. • (NeuML, David Mezzetti) / January 12