Attention Heads
Attention heads are key components of the Transformer architecture. Within each "multi-head attention" block, there are multiple attention heads operating in parallel. Each head learns to perform a specific type of weighted sum over the input token representations (or representations from the previous layer).
Essentially, each attention head learns to "attend" to different parts of the sequence, calculating query, key, and value vectors for each token. The compatibility between a query (from one token position) and keys (from other token positions) determines attention scores, which then weight the value vectors to produce the head's output. These outputs are then combined and fed into the next part of the Transformer layer.
In our experiment, specialized attention heads are found to be crucial for routing information across token positions in the residual stream, enabling the model to track variable assignments and perform dereferencing. The Attention Heads Patching page visualizes the causal effects of these heads.
Causal Interventions
Causal interventions are experimental techniques used in mechanistic interpretability to determine the causal role of specific internal components or activations within a neural network. This involves actively manipulating parts of the model's computation (e.g., activations of neurons or attention heads) and observing the effect on its subsequent internal states or final output.
An interchange intervention (or activation patching) involves running the model on a base input and a counterfactual input (which differs in a specific way). Activations from a specific component during the counterfactual run are then "patched" into the corresponding component during the base input run. If this patch changes the model's output in a way consistent with the counterfactual input, it suggests the patched component is causally involved in processing the information that differs between the inputs.
Causal tracing is a specific application of this, often used to track the flow of information related to a particular input feature through the network by systematically patching activations at different layers and token positions.
Counterfactual Input
In the context of causal interventions and mechanistic interpretability, a counterfactual input is a modified version of an original (or "base") input. It is carefully constructed to differ from the base input in one or a few specific, controlled ways.
For example, in our experiment, if a base program has , a counterfactual input might change this to while keeping all other lines of the program identical. By comparing the model's internal activations and output on the base input versus when activations from the counterfactual run are patched in, we can isolate the causal effects of the specific information that was changed (e.g., the value vs. ). This allows for precise hypothesis testing about how different parts of the input are processed.
Dereferencing
The computation that involves accessing the value or content associated with a variable or memory address. In classical computer architectures, information is often stored and organized using indirect addressing, where one memory location contains a pointer (address) that refers to another memory location containing the actual data of interest. The process follows a chain of references to retrieve the ultimate value, creating a flexible and efficient organization of information.
For example, when dereferencing variable that points to , which contains the value , the system first accesses the memory location associated with to obtain the pointer to , then uses that pointer to access the actual value . This two-step (or two 'hops') process allows the same value to be accessed through different variables, and values can be modified without changing the variables that refer to them. In cognitive systems, dereferencing plays a crucial role in accessing stored information about the world, allowing organisms to retrieve and use previously acquired knowledge in a flexible manner.
Developmental Trajectory
A developmental trajectory, in the context of neural network training, refers to the sequence of changes in a model's behavior, internal representations, and learned mechanisms as it progresses through training steps. It describes how a model's capabilities evolve from an initial random state to its final trained state.
Our paper identifies a distinct three-phase developmental trajectory for the Transformer learning variable dereferencing:
- Phase 1: Random prediction of numerical constants.
- Phase 2: Learning of shallow heuristics, such as prioritizing early variable assignments (e.g., the "line-1 heuristic").
- Phase 3: Emergence of a systematic mechanism for dereferencing assignment chains.
Direct Copy
An alternative approach to variable binding where values are directly copied rather than referenced through pointers. When a variable needs to be bound to a value, the system creates a complete copy of that value and associates it directly with the variable, eliminating the need for pointers and indirect addressing. However, this approach becomes impractical in real-world applications due to substantial memory overhead, as each variable binding requires a complete copy of the value. Additionally, maintaining consistency becomes challenging when values need to be updated, as the system must locate and modify all copies across all variable bindings.
The limitations become particularly apparent in systems that need to implement productive symbol manipulation, where new relationships between variables and values must be created dynamically. The rigid nature of direct copying makes it unsuitable for representing hierarchical or recursive structures, where variables often need to reference other variables.
Directed Graph
A mathematical structure that can be used to represent programs in our experiment, where nodes represent variables and constants, and directed edges represent assignment relationships. Each variable assignment in a program can be represented as an edge pointing from the variable being assigned to the value or variable it's being assigned to. For example, the assignment creates an edge from node to node , while creates an edge from node to a constant node . You can visualize these relationships using our Program Graph Visualizer.
The variable binding task can be understood as a graph traversal problem. To find a variable's value, one must follow the edges (assignments) from the starting variable node until reaching a constant node, always taking the most recent assignment when multiple assignments to the same variable exist. The presence of distractor chains in the program creates additional paths in the graph that must be ignored during traversal. This graph-theoretic view provides insight into the computational structure of variable binding.
The graph representation also helps visualize important program features like referential depth (length of the path from queried variable to final value) and distractor chains (paths that branch from or merge with the main path). Understanding programs as directed graphs helps explain why the task requires genuine variable binding: the system must maintain the distinct identity of nodes while following edges in a systematic way.
Distractor Chains
Variable assignment chains that are irrelevant to determining the queried variable's value in our experiment. These chains come in two forms: those completely independent from the queried variable chain, and those that branch out from – or merge with – the queried variable chain. The presence of distractor chains in programs tests the model's ability to selectively process relevant information while ignoring irrelevant assignments. The ability to distinguish between relevant chains and distractors demonstrates the model's capacity for maintaining distinct identity of variables and their bindings. You can analyze how the model handles distractor chains using our Program Analysis tools.
Grokking
Grokking is a phenomenon observed during the training of some neural networks, particularly on algorithmic or synthetic datasets. It refers to a situation where a model, after an initial period of fitting the training data (often through memorization and achieving low training loss), suddenly and sharply improves its generalization performance on unseen test data. This improvement can occur well after the model has already achieved near-perfect training accuracy.
Our findings add nuance to the traditional narrative about grokking, where models are thought to discard superficial heuristics in favor of more systematic solutions. In contrast, the model in this experiment builds its systematic solution upon, rather than entirely replacing, earlier learned heuristics.
Heuristics
In the context of our experiment, heuristics are simplified decision rules or strategies that the Transformer model learns during training. These strategies often provide approximate solutions to the variable dereferencing task and may be effective for a subset of program types or specific conditions, but they do not represent a complete or general solution.
Our analysis identified "early line heuristics" such as a "line-1 heuristic" (predicting the numerical constant from the first line of the program). These heuristics emerge in Phase 2 of the model's developmental trajectory. A key finding is that the model's final systematic solution builds upon, rather than entirely replacing, these earlier heuristics.
Indirect Addressing
A mechanism in computing systems that enables access to stored information through multiple levels of reference. Instead of directly specifying the location where a value is stored, the system first accesses an intermediate address that contains a pointer to the final location of the desired value. This creates a chain of references that must be followed to reach the final value, involving an initial address as the entry point, the pointer stored at that initial address, and the final address where the target value resides.
The power of indirect addressing comes from its ability to separate the identity of a variable from its current value while maintaining their connection through pointer relationships. This separation enables dynamic updating of variable bindings and the construction of arbitrary data structures. It is often taken to be a fundamental requirement for any system that needs to flexibly store and manipulate structured information, contrasting with the limitations of direct copy approaches.
Logits
Logits are the raw, unnormalized scores output by the final layer of a neural network model before a normalization function (like softmax for classification tasks or simply taking the argmax for prediction) is applied. In the context of our Transformer model, which predicts the next token (a character, such as a digit or variable name), the logits represent the model's confidence or evidence for each possible token in the vocabulary.
A higher logit value for a particular token indicates a higher likelihood that the model will predict that token. Analyzing logits (e.g., their evolution across layers or training steps) can provide insights into the model's decision-making process. The Logit Analysis page visualizes these values.
Mechanistic Interpretability
Mechanistic interpretability is a research field within AI safety and machine learning that aims to understand the internal workings and algorithms learned by neural networks. Instead of treating models as black boxes, it seeks to reverse-engineer their components (e.g., neurons, attention heads) and circuits (interacting groups of components) to explain how they perform specific computations or achieve particular capabilities.
Techniques in mechanistic interpretability often involve causal interventions, activation analysis, and studying model behavior on carefully constructed inputs. The goal is to build human-understandable models of what the neural network has learned, which can be crucial for verifying model behavior, identifying failure modes, and ensuring alignment with intended goals. Our paper employs mechanistic interpretability techniques to uncover how the Transformer learns to perform variable dereferencing.
Queried Variable
The variable whose value needs to be determined in our experiment, specified in the final line of each program using the format:
#variable:
The queried variable represents the target of the dereferencing task, requiring the model to traverse a chain of variable assignments to determine its final value. This traversal tests the model's ability to maintain and manipulate structured information, as it must track variable bindings across multiple steps while ignoring irrelevant distractor chains. The ability to correctly resolve the queried variable's value demonstrates the model's capacity for implementing symbolic computation and variable binding mechanisms. You can visualize how the model processes queried variables using our Program Graph Visualizer.
Read/Write Memory
In Transformer architectures, read/write memory is implemented through the residual stream, which functions as a high-dimensional communication channel between different components of the model. This provides an addressable space where different components can read from and write to specific subspaces through learned linear projections. The residual stream accumulates layer outputs additively, allowing information to flow through the model while maintaining the ability for different layers to communicate through distinct subspaces of this shared channel. Attention heads are key to how information is selectively read from and written to these subspaces.
The high dimensionality of the residual stream enables multiple pieces of information to be stored and transmitted simultaneously in different subspaces, which can be thought of as independent communication channels. Information written to a subspace persists until actively overwritten or modified by another component, similar to traditional computer memory. This architecture enables Transformers to maintain and manipulate information over long sequences and implement sophisticated algorithms through compositions of attention heads, providing many of the same core capabilities as traditional computer memory systems – addressable storage, selective access, and persistence of information.
Referential Depth
The number of assignment steps between the queried variable and its final value. This metric quantifies the complexity of variable dereferencing required to solve a particular program. For example, in the sequence:
x = y y = z z = 3 #x:
The referential depth is 3, as it takes three steps (or 'hops') to get from x to the final value 3. Referential depth is an important aspect of program complexity, testing the model's ability to maintain and follow longer chains of variable references. In our experiment, we trained our neural networks on programs with referential depth up to 4, to ensure perfect test set performance reflects a capacity to systematically track and resolve multiple levels of indirect addressing. You can analyze how the model performs with different referential depths using our Checkpoint Stats page.
Residual Stream
In a Transformer architecture, the residual stream is a high-dimensional vector space that serves as the backbone of information flow through the model. It stores token embeddings at each position and acts as a communication channel between different layers, accumulating layer outputs additively. This structure allows information to flow through the model while enabling different layers to communicate through distinct subspaces of the shared channel.
The residual stream has a deeply linear structure where each layer reads information through linear transformations at the start, processes that information, and writes its output back through another linear transformation. This architecture implements a form of read/write memory, where different components can access and modify specific subspaces through learned projections. The lack of a privileged basis means the specific subspaces used for communication are learned during training and can be rotated without changing the model's behavior, differing from traditional computer memory with fixed addressing schemes while maintaining similar core capabilities. Attention heads and MLP blocks in each layer dynamically access and modify information within this stream. You can analyze how the model's residual stream evolves during training using our Logit Analysis page.
Subspaces
In the context of our experiment, numerical and variable subspaces refer to specific, lower-dimensional regions (subspaces) identified within the Transformer's high-dimensional residual stream activations. These subspaces are found to selectively and systematically encode information about numerical constants (0-9) and variable names (a-z), respectively.
We find that as training progresses, the model's representations for different numbers and different variables form distinct, separable clusters within these identified subspaces. Causal interventions further show that these subspaces play a causal role in how the model processes and predicts numbers and variables. The emergence of such structured subspaces is evidence of the model learning to organize and differentiate symbolic information. You can see this evolution on the Subspace Visualization page.
Transformer Architecture
The Transformer is a neural network architecture that has revolutionized sequence processing tasks, particularly in natural language processing. Introduced by Vaswani et al. (2017), its core innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing information at each position.
Key components include multi-head self-attention, positional encodings (to provide sequence order information, like RoPE used in our model), feed-forward networks, layer normalization, and the residual stream. Unlike recurrent neural networks (RNNs), Transformers can process all tokens in a sequence in parallel, leading to better training efficiency. Our experiment uses a Transformer model trained from scratch to investigate its capacity for learning variable binding.
Variable Binding
A fundamental computational mechanism that creates associations between variables (abstract roles) and their values (specific instances) in a way that makes these associations computationally accessible. Variable binding requires two essential components: a mechanism for storing variable-value associations and a mechanism for accessing these associations. This separation enables the construction of general-purpose computational procedures that can operate on arbitrary inputs.
The computational significance of variable binding lies in its ability to separate the machinery that performs computations from the specific values involved in those computations. When a computational process references a variable, it can operate on whatever value is currently bound to that variable, allowing the same computational machinery to work with different values without modification. This capacity is important for representing structured information, where complex structures can be represented as sets of variable bindings, with each variable representing a role or position within the structure, and the bound values representing the elements filling those roles. You can explore how our model learns variable binding through the Training Trajectory page.
Variable Chain
A sequence of variable assignments that must be followed to find a variable's final value. Variable chains represent the core structure that the model must learn to process in order to implement genuine variable binding and dereferencing. For example, in the sequence:
a = b b = c c = 5
The variable chain for 'a' would be: a → b → c → 5. These chains essentially create a directed graph structure where nodes are variables and edges are assignment relationships. The ability to correctly traverse these chains while maintaining relevant variable bindings and ignoring distractors demonstrates the model's capacity for implementing symbolic computation through learned mechanisms rather than built-in architectural features. You can visualize these chains using our Program Graph Visualizer.