Watch the Video Abstract
Background & Motivation
Variable binding, the computational process of associating abstract variables with specific values, is fundamental to both computation and cognition. Consider a variable : at one moment it might be bound to , then rebound to , or even to another variable . When we write and , the variable acts as a reference, and the computation of depends on whatever value currently holds. This separation between variables (as stable references) and their potentially changing values enables general-purpose computations that work with arbitrary inputs and the construction of complex data structures.
In classical symbol processing systems, variable binding is implemented through addressable read/write memory, where variables name memory locations and values are the data stored at these locations. In cognition, variable binding is central to the structure-sensitive properties of language and thought such as compositionality and systematicity; if we understand "John loves Mary," we can readily understand "Mary loves John" by rebinding abstract roles to different individuals. More broadly, variable binding is thought to be essential for episodic memory (binding unique details of life events), spatial navigation (binding landmarks to locations in cognitive maps), and constructing the rich internal representations that enable reasoning and planning.
Despite its importance in classical computation and cognition, variable binding has long posed a challenge for artificial neural networks, known as the "binding problem." Unlike classical architectures with discrete addressable memory, neural networks encode information as distributed patterns of activation without built-in separation between processor and memory. When multiple variable-value associations must be maintained simultaneously, these distributed encoding can interfere with one another, making it difficult to isolate specific values or maintain binding independence. This apparent limitation has led skeptics to argue that neural networks cannot achieve genuine variable binding, and must rely instead on pattern matching or memorization that fails to capture the systematicity of symbolic computation.
Yet modern neural networks, particularly Transformers, demonstrate significant success on structure-sensitive tasks like code generation and logical reasoning that seem to require sophisticated variable binding. This raises an intriguing question that motivates our investigation: if traditional neural networks struggle with the binding problem, how do these models achieve such capabilities? Are they implementing a form of variable binding through emergent mechanisms induced during training, or achieving success through alternative computational strategies?
We seek to understand whether and how variable binding mechanisms might emerge in Transformer architectures in a controlled setting. We designed a synthetic dereferencing task using symbolic programs that contain chains of variable assignments. To be successful at this task, one needs to systematically track relevant variable assignments (and ignore irrelevant ones) to find out the value of a queried variable. In other words, the task is not solvable through memorization or shallow heuristics alone. We find that even a small Transformer trained on this task can learn a general mechanism to solve it for arbitrary held-out programs, and we uncover the developmental trajectory the leads the model to transition from random behavior to the induction of this general mechanism.
Experimental Design
We train a Transformer model (12 layers, 8 attention heads, 512-dimensional residual stream) on synthetic symbolic programs. The model is trained from scratch using a standard next-token prediction objective on 450,000 programs, with no prior exposure to natural language or programming.
Program Structure
Each program follows a consistent 17-line format:
- Lines 1-16: Variable assignments in the format
variable = value
variable
: A single lowercase letter (a
–z
)value
: Either a numerical constant (0
–9
) or another variable name (a
–z
)
- Line 17: Query in the format
#variable:
- The
#
symbol marks the start of the query - Followed by the variable to be dereferenced
- Ends with a
:
symbol
- The
Each character (including =
, #
, and :
) is tokenized separately. Programs vary in their "referential depth" – the number of assignment steps or "hops" needed to trace from the queried variable to its final numerical value.
Example Programs
1-Hop Program
2-Hop Program
4-Hop Program
a=8
b=a
c=6 ← example distractor
d=c ← example distractor
e=d ← example distractor
f=2
g=b
h=f
i=h
j=7
k=g
l=j
m=i
n=l
o=k
p=e ← example distractor
#o:
The most complex case requires following a reference chain containing four assignments. In this example, finding the value of the queried variable requires tracing the chain . Programs also contain irrelevant distractor chains that must be ignored, such as .
Our design uses 16-line programs to create sufficient complexity for interesting reference patterns while remaining computationally tractable. The chain depth of 1-4 hops tests increasing levels of indirection within learnable bounds. The sampling strategy is weighted to create long distractor chains, preventing simple heuristics like "follow the longest chain" from succeeding.
Three Phases of Learning
Our analysis reveals a clear developmental trajectory with three distinct phases. The model exhibits sharp transitions between phases at steps 800 and 14,000, marking fundamental shifts in how it approaches the task. This trajectory exemplifies grokking – the phenomenon where neural networks suddenly transition from memorization to generalization – though as we'll see, our findings challenge traditional accounts of how this transition occurs.
Phase 1: Random Numerical Prediction (Steps 0-800)
The model achieves only ~12% accuracy, essentially chance level for 10 possible digits. During this phase, the model learns that outputs should be numerical constants (0-9) rather than variables or special tokens, but the prediction distribution is notably imbalanced and unstable. The model shows no understanding of variable assignments or reference relationships – it simply learns the output format without any systematic strategy for finding the correct value.
Phase 2: Heuristic Discovery (Steps 1200-14000)
A sharp performance jump occurs as accuracy rises from 12% to 56%. The model develops position-based heuristics – simplified decision rules that work well for many programs. For example, a line-1 heuristic, selects the numerical constant from the first program line and achieves over 90% accuracy when the answer appears on line 1. Likewise, a line-2 heuristic selects the numerical constant from the second line when available, achieving approximately 65% accuracy when applicable. These heuristics prove particularly effective for multi-hop programs because our sampling process tends to place root values of longer chains in early lines. By contrast, 1-hop programs show slower convergence during this phase because their answers can appear on any line, making position-based heuristics less reliable.
Phase 3: Systematic Solution (Steps 14000-105400)
Another sharp transition occurs as accuracy jumps from 56% to over 99.9%. The model develops a general mechanism capable of tracing variable chains of any depth while appropriately ignoring distractor assignments. This phase demonstrates robust performance regardless of query variable chain length (1-4 hops) and accurate predictions independent of where the correct answer appears (lines 1-16). The model systematically tracks variable bindings through the residual stream and properly handles distractor chains that branch from the main reference path. The near-perfect performance indicates the model has learned a true variable dereferencing algorithm rather than relying on statistical patterns.
Tracing the Emergence of a General Mechanism
Interactive Example: Residual Stream Patching Analysis
Click to explore how causal interventions reveal information flow through the model's layers
Understanding that our model achieves near-perfect accuracy tells us what it can do, but not how it does it. To uncover the actual strategy the model discovered, we used mechanistic interpretability techniques to reverse-engineer the model's internal computations. Our primary tool was causal interventions with counterfactual inputs. The approach is straightforward: we take a program and create a modified version where we change the root value of the queried variable chain (e.g., from to ). We then run the model on both programs and cache the internal activations from the counterfactual (modified) program. Next, we run the model again on the original program, but this time we selectively replace the activation of some component (e.g., the residual stream at a given layer or individual attention heads) with the cached value from the counterfactual run. When replacing a particular component's activations causes the model to change its output (e.g., instead of ), this provides evidence that the relevant component is causally implicated in propagating the answer.
Traditional accounts of grokking suggest that models develop general mechanisms that replace earlier shallow strategies like memorization or heuristics. Our causal analysis challenges this narrative. We find that rather than abandoning its heuristics from Phase 2, the model builds upon them when developing its systematic solution in Phase 3. For instance, when the correct answer appears on line 1, the model continues to rely on its line-1 heuristic instead of tracing the full variable chain through attention mechanisms. The systematic dereferencing mechanism becomes causally influential primarily when these simpler heuristics would produce incorrect results. All inputs still flow through every layer of the network, but the causal pathways that determine the final output vary depending on the program structure.
The systematic dereferencing mechanism relies on a specialized circuit of attention heads that work together to route information across layers and token positions. Early-layer heads handle the first "hop" – they identify when a variable points to another variable or value and begin moving that information forward. Mid-layer heads continue the chain for programs requiring multiple hops, with some heads being particularly versatile in handling the same operation for different referential depths. Finally, late-layer heads aggregate the information at the query position to produce the final answer. Our causal analysis indicates that these heads coordinate by reading from and writing to specific subspaces within the residual stream, effectively implementing a form of addressable memory where variable-value associations can be selectively accessed and updated across processing steps. This division of labor allows the model to handle variable chains of different depths using a flexible, composable mechanism rather than hard-coding separate pathways for each possible chain length.
Having established that attention heads coordinate through specific subspaces within the residual stream, we investigated how these subspaces are structured to enable the storage and retrieval of variable-value associations. Our analysis of activation patterns across thousands of programs revealed that the model dedicates distinct subspaces of its 512-dimensional residual stream to encode variable identities (letters a
–z
) and numerical constants (digits 0
–9
). We validated these findings through causal interventions: when we swap just the variable subspace between two programs, we can change which variable the model looks up with 87% success. Similarly, swapping the numerical subspace changes which value the model outputs with 92% success. This separation of variables and values is reminiscent of the corresponding separation in classical symbol processing systems.
Using Variable Scope to Explore Our Findings
Variable Scope provides interactive tools to systematically explore the research findings presented in this paper. The platform is organized to guide researchers through the experimental results, from initial observations to mechanistic analysis.
Basic Overview
Begin with an overview of the model's learning dynamics and task structure.
Program Graph Visualizer
Description: Visualize programs as interactive directed graphs.
Observation: Computational complexity varies with chain depth and distractor chains.
Recommended action: Use the shuffle function to examine programs of varying complexity, particularly 4-hop programs with distractors.
Training Trajectory
Description: Model accuracy over 105,000 training steps with multiple performance metrics.
Key finding: Three distinct phases with transitions around steps 800 and 14,000.
Key observation: Plateau at 56% accuracy (Phase 2) before transition to 99.9% (Phase 3).
Understanding Each Learning Phase
Examine the model's behavior during each developmental phase.
Checkpoint Stats
Description: Detailed performance analysis at specific training checkpoints.
Key finding: Phase 2 models uses line-specific heuristics (see "Accuracy by Correct Line").
Recommended checkpoints: Step 1200 (early Phase 2), Step 12000 (late Phase 2), Step 105400 (final).
Checkpoint Comparison
Description: Side-by-side comparison of multiple checkpoints.
Key finding: Progressive development from random predictions to heuristics to systematic solution.
Analysis focus: Changes in prediction distributions between phases.
Checkpoint Analysis
Description: Flexible analysis tool for exploring model outputs with custom queries.
Application: Testing specific hypotheses about model behavior.
Note: Data can be exported as CSV for further analysis.
Program Analysis
Description: Heatmap visualization of model predictions across programs and checkpoints.
Key finding: Certain programs are solved by Phase 2 heuristics while others require the systematic mechanism.
Comparison: Analyze "hardest" versus "easiest" programs to identify patterns in learning progression.
Mechanistic Understanding
Investigate the internal mechanism that implement variable binding.
Logit Analysis
Description: Evolution of logit values through the model's 12 layers.
Key finding: Information about the correct answer emerges progressively across layers.
Observation: Note the competition between the "first line constant" signal (red) and the correct answer signal (green).
Attention Heads Patching
Description: Causal importance of attention heads determined via intervention experiments.
Key finding: Specific heads specialize in different referential depths (e.g., heads 6.5 and 7.7 for 1-hop chains).
Analysis: Examine how patching effects vary with referential depth (1-4 hops).
Subspace Visualization
Description: UMAP projections showing organization of variables (a-z) and numbers (0-9) in the residual stream.
Key finding: Distinct subspaces emerge for variables versus values, facilitating systematic binding.
Notable pattern: Transition from random distribution to organized clusters.
Additional Resources
Definitions of technical terms used throughout the platform.
Complete technical details and supplementary experiments.
Cite This Work
@inproceedings{wuHowTransformersLearn2025,
title = {How Do Transformers Learn Variable Binding in Symbolic Programs?},
booktitle = {Forty-Second International Conference on Machine Learning},
author = {Wu, Yiwei and Geiger, Atticus and Milli{\`e}re, Rapha{\"e}l},
year = {2025},
}