sparse-autoencoder-brain-llm-topography - SKILL.md Agent Skill

name: sparse-autoencoder-brain-llm-topography description: "Sparse Autoencoders (SAEs) from mechanistic interpretability bridge LLM internal representations with cortical semantic topography in human brains. Decomposes LLMs (GPT-2 XL, Llama-3.1-8B) into 16K-32K interpretable features per layer; semantic features recover 94% of brain encoding performance. Five a priori semantic subcategories map onto distinct brain regions via formal convergence testing. Validated across English, Chinese, French. Accepted at CoNLL 2026. Activation: sparse autoencoder, brain-LLM alignment, cortical topography, SAE brain mapping, semantic encoding, computational neurolinguistics." arxiv_id: "2605.23035" published: "2026-05-21" authors: "Dongxin Guo, Jikun Wu, Siu Ming Yiu" tags: [brain-llm-alignment, sparse-autoencoders, cortical-topography, mechanistic-interpretability, neurolinguistics]

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

This paper bridges mechanistic interpretability (sparse autoencoders) with computational neurolinguistics by decomposing LLM internal representations into interpretable features and showing that semantic features alone recover 94% of neural encoding performance and map onto known cortical semantic topography.

Source: arXiv: 2605.23035 | Accepted at CoNLL 2026

Core Methodology

Key Innovation

The paper addresses a key gap in computational neurolinguistics: why intermediate layers of LLMs best predict human brain responses to language. It decomposes this question using sparse autoencoders (SAEs) to identify interpretable features within LLMs that align with brain activity.

Technical Framework

SAE Training on LLMs: Train sparse autoencoders on GPT-2 XL and Llama-3.1-8B activations to extract 16K-32K interpretable features per layer
Human-Validated Feature Taxonomy: A structured taxonomy categorizes SAE-discovered features into semantic, syntactic, positional, and other types
Encoding Model Framework: Use SAE features as predictors in brain encoding models to predict fMRI responses to natural language stimuli
Feature-Importance Analysis: Quantify each feature category's unique contribution to brain prediction performance
Cortical Topography Prediction: Test whether five a priori semantic subcategories (derived from neuroscience) map onto distinct brain regions
Cross-Lingual Validation: Validate across English, Chinese, and French

Key Results

94% recovery of peak encoding performance using semantic features alone (ρ=.254 vs ρ=.270 baseline)
Substantially exceeds variance-matched baselines (ρ=.238, ρ=.193)
Formal convergence test confirms cortical topography alignment (Spearman ρ=1.0, hypergeometric p<.001)
Reading time prediction: SAE features predict human reading times beyond lexical controls (ΔR²=.083, p<.01)
Prediction-error analysis: Preliminary evidence that the brain encodes unexpected semantic content
Generalizes across English, Chinese, and French

Applications

Brain-LLM alignment analysis: Use SAEs to map which types of LLM features drive neural prediction
Language model interpretability: Understand what LLM layers encode in terms of brain-relevant features
Cross-lingual neuroscience: Study how semantic representations organize across languages in the brain
Reading time modeling: Build better models of human language processing using semantically-informed features

Related Skills

brain-llm-key-neurons-grammar
fcn-llm-brain-network-understanding
representation-steering
mllm-brain-alignment-task-probing