name: transformerlens-answer-tracing-and-logit-attribution description: Trace where a model writes answer-relevant features into the residual stream and which components push the logits toward a target token such as " 2".
TransformerLens Answer Tracing and Logit Attribution
Use this skill when the task is to explain why a model predicts a target token, such as why "what is 1 + 1" points toward " 2".
What this skill is for
This skill answers questions like:
- which layers and positions make
" 2"more likely? - which heads or MLPs write the decisive features?
- when does the answer become linearly readable in the residual stream?
- does the answer exist early and get refined, or only appear late?
For your project, this is the baseline clean-prompt analysis that must be done before introducing OOD tokens.
Main TransformerLens tools
model.run_with_cache(...)ActivationCache.accumulated_resid(...)ActivationCache.decompose_resid(...)ActivationCache.get_full_resid_decomposition(...)ActivationCache.logit_attrs(...)model.tokens_to_residual_directions(...)cache.stack_head_results(...)cache.apply_ln_to_stack(...)
Conceptual framing
The central question is not just "what token is predicted," but "what vectors inside the residual stream are pointing in the answer direction, and which components wrote them?"
TransformerLens is especially strong here because it gives you:
- the full cached residual stream
- decompositions by block or by finer component
- direct logit attribution utilities
- logit-lens style readouts at many layers
Recommended workflow for what is 1 + 1 -> 2
1. Pin down the answer token direction
First identify whether the answer is a single token under your tokenizer. Then get the residual direction associated with that token.
This is your target readout direction.
2. Measure the final logit margin
Before decomposing anything, record:
- the answer token logit
- top competing logits
- the logit difference between
" 2"and distractors such as" 1"," 3", or other arithmetic completions
This gives a stable target metric for later comparisons.
3. Run accumulated-residual analysis
Use accumulated_resid(..., apply_ln=True) at the final prediction position to ask: after each layer, how much of the answer direction is already present?
This is the cleanest way to see when the answer becomes readable.
Interpretation:
- early rise means the model forms the answer direction quickly
- late rise means the model delays committing until later blocks
- oscillation means some layers add and others partially erase or rotate the signal
4. Run direct logit attribution over components
Use decompose_resid(...) first for coarse attribution:
- embed
- pos_embed
- each attention block output
- each MLP output
Then use get_full_resid_decomposition() when you want more granular breakdowns into individual heads or MLP-neuron-level components.
This tells you which components actually write toward the answer logit.
5. Drill down to per-head analysis
If a layer looks important, compute head results and attribute them individually. In many tasks, only a few heads carry the decisive evidence.
Good questions:
- which head most strongly pushes toward
" 2"? - does that head write at the final position, or route information from earlier positions?
- are there both positive and negative contributor heads?
6. Inspect position-specific writing
For arithmetic and short prompts, position matters a lot. The key question is whether the answer-relevant feature is written:
- at the final token position directly,
- copied from an earlier operand token,
- or assembled gradually across positions.
A useful habit is to inspect residual decompositions at every position, not just the last one.
Minimal code skeleton
import torch
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained("qwen2.5-1.5b")
prompt = "what is 1 + 1"
answer = " 2"
tokens = model.to_tokens(prompt)
logits, cache = model.run_with_cache(tokens)
answer_dir = model.tokens_to_residual_directions(model.to_tokens(answer, prepend_bos=False))
accum, labels = cache.accumulated_resid(apply_ln=True, return_labels=True)
resid_stack, comp_labels = cache.decompose_resid(return_labels=True)
logit_attrs = cache.logit_attrs(resid_stack, answer)
How to interpret the outputs
accumulated_resid
Use this to answer: when is the answer readable?
decompose_resid
Use this to answer: which blocks wrote the answer-relevant direction?
get_full_resid_decomposition
Use this to answer: which exact heads or finer units wrote the signal?
logit_attrs
Use this to answer: which components positively or negatively contribute to the target token’s logit?
What to save for later OOD analysis
For each clean prompt, save:
- answer token(s)
- final-position residual stream
- layerwise accumulated readability curve
- component-wise attribution scores
- per-head positive and negative contributors
- the top few heads and MLPs you suspect encode the arithmetic feature
These become your "clean reference features" for later robustness tests.
Good sanity checks
- replace
1 + 1with1 + 2and confirm the attribution pattern changes meaningfully - swap wording while preserving semantics and see whether the same heads are reused
- compare base vs instruct Qwen models to test whether chat tuning changes where arithmetic evidence is written
Common failure modes
- forgetting that direct logit attribution is only meaningful after the correct normalization/readout treatment
- over-interpreting a coarse sublayer contribution without drilling down to heads or neurons
- treating a readable feature as necessarily causal; readability and causal necessity are different
- ignoring negative contributors that suppress distractor answers
Best next step
After you know where the answer signal is written in the clean prompt, move to activation patching under OOD token injection. That is the right way to test whether those answer-relevant features remain present, get damaged, or become unreadable.
References
- TransformerLens
ActivationCachedocs and examples fordecompose_resid,accumulated_resid, andlogit_attrs - TransformerLens exploratory-analysis demo for per-head residual attribution and attention analysis
- TransformerLens
HookedTransformerdocs for token and residual-direction helpers