Background & Motivation
This experiment investigates whether and how a Transformer-based neural network can learn to solve a task that involves variable binding and dereferencing.
In classical computer architectures, variable binding is implemented through an addressable read/write memory used to maintain variable identity and values through indirect addressing. In the Transformer architecture, the residual stream provides a high-dimensional vector space where information can be stored and accessed in distinct subspaces through learned linear projections. This raises the question of whether Transformer-based neural networks can learn to use the residual stream to implement variable binding operations without explicit architectural support for symbol manipulation.
Our experiment focuses on whether a Transformer-based neural network trained from scratch on a synthetic variable binding task can learn a systematic mechanism for variable binding to solve the task. We use causal intervention methods such as activation patching to understand how the trained network gradually learns to solve the task. Specifically, we investigate whether and how the network eventually learns a general mechanism to track variable assignments and resolve reference chains. This work contributes to our understanding of how neural architectures might implement symbolic computation.
Variable Binding Task
We train a Transformer model on a synthetic task involving variable assignment programs. Each program presents a sequence of variable bindings followed by a query that requires dereferencing a specific variable through a chain of assignments. The task requires the network to track variable bindings across multiple steps and maintain this information in a form that enables accurate dereferencing.
Example Program
k=2
j=k
t=8
b=j
s=6
g=8
w=j
r=8
d=r
j=b
s=j
s=5
v=j
z=b
d=w
z=2
#d:
In this example, dereferencing variable d
requires following the variable chain (highlighted in green):d → w → j → k → 2. The highlighted assignments show the relevant chain that must be followed backwards from the query to reach the final value. All other assignments are distractors that must be ignored. The correct output is 2, requiring the model to track and traverse multiple variable references while maintaining their correct temporal order.
Key Task Components
- Variable Chain: A sequence of assignments that must be followed to find a variable's value, forming a directed graph structure
- Queried Variable: The variable to be dereferenced, specified in the final line
- Referential Depth: The number of assignment steps between the queried variable and its final value
- Distractor Chains: Assignments irrelevant to the queried variable, testing selective processing
Program Structure
Each program follows a consistent format:
- First 16 lines: Variable assignments in the format:
variable = value
where value is either:- A constant (digit 0-9)
- Another variable name
- Final line: Query in the format:
#variable:
Research Questions
Through this task, we investigate:
- How the model represents and maintains variable bindings in its residual stream
- What computational strategies emerge for tracking assignment chains
- How the model distinguishes between relevant and irrelevant assignments
- Whether systematic patterns develop in the model's handling of variable references
- How the model's internal representations relate to traditional memory systems
Analysis Approach
Our investigation employs mechanistic interpretability techniques to:
- Map how information flows through the model's attention mechanisms
- Identify specific functions of different attention heads
- Track how variable bindings are encoded and updated
- Examine the development of computational strategies during training
- Compare the learned mechanisms with theoretical predictions