whisper-ecog-alignment-neural-encoding - SKILL.md Agent Skill

name: whisper-ecog-alignment-neural-encoding description: Whisper 语音模型与人类 ECoG 神经响应的对齐框架。使用可解释时间解析神经编码器，揭示中间层最强的脑对齐，并通过注意力映射和音素分析验证解剖学上的一致性组织。 tags:

whisper
ecog
neural-encoding
speech-processing
brain-alignment
time-resolved-modeling
interpretability
phoneme-analysis activation_keywords:
whisper
ecog
neural encoding
speech brain
whisper brain alignment
speech foundation model
ECoG neural
cortical speech processing related_skills:
whisper-ecog-alignment
neural-encoding-evaluation-meeg
brain-llm-alignment paper: arxiv_id: "2606.02305v1" title: "Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding" authors:
- "Matteo Ciferri"
- "Tommaso Boccato"
- "Michal Olak" published: "2026-06-01" categories:
- "q-bio.NC"
- "cs.HC" link: "https://arxiv.org/abs/2606.02305v1"

Whisper-ECoG Alignment Neural Encoding

核心突破

首次系统研究 Whisper 语音基础模型与人类皮层 ECoG 神经响应的对齐关系，揭示中间层提供最强的脑对齐，并通过可解释时间解析神经编码器验证解剖学上一致的音素组织。

关键发现

层级对齐：中间层最强对应神经活动
时间结构：高分辨率 ECoG 需要时序建模
音素组织：编码相关电极呈现解剖学一致的音素类别组织

方法论详解

1. 时间解析神经编码器

架构设计：

输入：Whisper 语音嵌入 + ECoG 神经信号
模型：循环时序模型 + 软注意力

组件：
- Speech embeddings（来自 Whisper 各层）
- Recurrent temporal model（处理时序依赖）
- Soft attention mechanism（关注关键时间点）
- Output layer（预测神经响应）

核心模块：

import torch
import torch.nn as nn

class TimeResolvedNeuralEncoder(nn.Module):
    """时间解析神经编码器"""
    
    def __init__(self, embed_dim=512, neural_dim=128, hidden_dim=256):
        super().__init__()
        
        # 语音嵌入处理
        self.embed_processor = nn.Linear(embed_dim, hidden_dim)
        
        # 循环时序模型
        self.temporal_rnn = nn.LSTM(
            hidden_dim, hidden_dim,
            num_layers=2,
            batch_first=True,
            bidirectional=True
        )
        
        # 软注意力机制
        self.attention = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
            nn.Softmax(dim=1)
        )
        
        # 输出层（预测神经响应）
        self.output = nn.Linear(hidden_dim * 2, neural_dim)
        
    def forward(self, speech_embeds, neural_data):
        """前向传播"""
        
        # 处理语音嵌入
        processed_embeds = self.embed_processor(speech_embeds)
        
        # 时序建模
        temporal_features, (h_n, c_n) = self.temporal_rnn(processed_embeds)
        
        # 计算注意力
        attention_weights = self.attention(temporal_features)
        
        # 加权聚合
        weighted_features = torch.sum(
            temporal_features * attention_weights,
            dim=1
        )
        
        # 预测神经响应
        neural_prediction = self.output(weighted_features)
        
        return neural_prediction, attention_weights

2. Whisper 层级分析

层级特征提取：

Whisper 模型结构：
- Encoder layers: 1-12
- Decoder layers: 1-12

实验分析：
- 测试所有 Encoder 层
- 测试所有 Decoder 层
- 对比：早期、中间、后期层

结果：
- 中间层（Encoder 5-8）最强脑对齐
- 支持层级假设：模型表征与皮层处理匹配

层级对齐度量：

def compute_layer_alignment(whisper_layers, ecog_data):
    """计算各层的脑对齐"""
    
    alignments = {}
    
    for layer_idx, embeds in enumerate(whisper_layers):
        # 使用时间解析编码器
        encoder = TimeResolvedNeuralEncoder()
        
        # 训练编码器预测神经响应
        predictions = encoder(embeds, ecog_data)
        
        # 计算相关性
        correlation = compute_pearson_correlation(predictions, ecog_data)
        
        alignments[layer_idx] = {
            'correlation': correlation,
            'encoding_accuracy': compute_accuracy(predictions, ecog_data)
        }
    
    return alignments

3. 音素可解释性分析

音素类别映射：

步骤：
1. 识别编码相关电极（高相关性）
2. 分析电极的音素选择性
3. 映射音素类别到解剖位置
4. 验证解剖一致性

结果：
- 听觉皮层：语音音素
- 运动皮层：辅音音素
- 额叶：高级音素组织

音素选择性分析：

class PhonemeInterpretabilityAnalyzer:
    """音素可解释性分析"""
    
    def __init__(self, electrode_data):
        self.electrodes = electrode_data
        
    def identify_encoding_electrodes(self, encoding_performance):
        """识别编码相关电极"""
        
        # 选择高相关性电极
        threshold = np.percentile(encoding_performance, 75)
        informative_electrodes = [
            e for e, perf in encoding_performance.items()
            if perf > threshold
        ]
        
        return informative_electrodes
        
    def analyze_phoneme_selectivity(self, electrode, phoneme_labels):
        """分析电极的音素选择性"""
        
        # 计算每个音素的电极激活
        phoneme_activations = {}
        for phoneme in set(phoneme_labels):
            activations = self.get_activations_for_phoneme(electrode, phoneme)
            phoneme_activations[phoneme] = np.mean(activations)
        
        # 找主导音素
        dominant_phoneme = max(phoneme_activations.items(), key=lambda x: x[1])
        
        return dominant_phoneme
        
    def map_to_anatomy(self, phoneme_categories):
        """映射音素类别到解剖位置"""
        
        anatomical_mapping = {}
        
        for electrode, phoneme in phoneme_categories.items():
            # 获取电极解剖位置
            location = self.get_anatomical_location(electrode)
            
            anatomical_mapping[electrode] = {
                'phoneme': phoneme,
                'location': location
            }
        
        return anatomical_mapping

核心洞察

1. 中间层最强对齐

层级假设验证：

假设：

模型表征层级对应皮层处理层级
中间层捕获抽象语音特征
早期层过于低级，后期层过于抽象

验证结果：

层级对齐（Encoder）：
- Layer 1-4: 低级特征，弱对齐
- Layer 5-8: 中级特征，强对齐（峰值）
- Layer 9-12: 高级特征，中等对齐

对应皮层：
- 早期层 ~ 初级听觉皮层（A1）
- 中间层 ~ 中级听觉区域（ belt/parabelt）
- 后期层 ~ 高级语音区域（STG、IFG）

理论意义：

支持层级语音处理理论
证明模型层级有生物合理性
为语音 AI 对齐提供证据

2. 时间结构重要性

高分辨率 ECoG 优势：

对比：
- 传统方法：线性映射，忽略时间结构
- 本方法：循环 + 注意力，捕获时间依赖

优势：
- ECoG 时间分辨率：毫秒级
- 需要时序建模才能充分利用
- 注意力揭示时间局部对齐

时间局部对齐：

def analyze_temporal_alignment(attention_weights, speech_events):
    """分析时间局部对齐"""
    
    # 找注意力峰值时间点
    peak_times = find_attention_peaks(attention_weights)
    
    # 对应语音事件
    aligned_events = []
    for peak_time in peak_times:
        event = speech_events.find_event_at_time(peak_time)
        aligned_events.append({
            'time': peak_time,
            'event': event,
            'attention': attention_weights[peak_time]
        })
    
    return aligned_events

结果：

注意力峰值对应语音边界
时间局部对齐揭示动态处理
验证语音分割的神经机制

3. 音素解剖组织

解剖一致性发现：

音素类别组织：

语音音素（Vowels）：
- 位置：听觉皮层（A1, belt）
- 特点：频率编码，稳定激活

辅音音素（Consonants）：
- 位置：运动皮层（PMC, SMA）
- 特点：发音相关，动态激活

高级音素组织：
- 位置：额叶（IFG, STG）
- 特点：音素序列、语法处理

发现意义：
- 解剖上一致的组织
- 验证音素类别假设
- 揭示语音感知的运动参与

实现指南

Whisper 嵌入提取

import whisper

class WhisperEmbeddingExtractor:
    """Whisper 嵌入提取"""
    
    def __init__(self, model_name="base"):
        self.model = whisper.load_model(model_name)
        
    def extract_layer_embeddings(self, audio):
        """提取各层嵌入"""
        
        # 加载音频
        audio_tensor = whisper.load_audio(audio)
        audio_features = whisper.log_mel_spectrogram(audio_tensor)
        
        # 编码各层
        embeddings = {}
        
        with torch.no_grad():
            # Encoder
            for layer_idx in range(self.model.encoder.num_layers):
                # Hook 提取中间层特征
                layer_embed = self.extract_encoder_layer(audio_features, layer_idx)
                embeddings[f'encoder_{layer_idx}'] = layer_embed
        
        return embeddings
        
    def extract_encoder_layer(self, audio_features, layer_idx):
        """提取特定 Encoder 层"""
        
        # 注册 hook
        def hook_fn(module, input, output):
            return output
        
        hook = self.model.encoder.layers[layer_idx].register_forward_hook(hook_fn)
        
        # 前向传播
        with torch.no_grad():
            encoder_output = self.model.encoder(audio_features)
        
        hook.remove()
        
        return encoder_output

ECoG 数据预处理

class ECoGDataProcessor:
    """ECoG 数据预处理"""
    
    def __init__(self, sampling_rate=1000):
        self.fs = sampling_rate
        
    def preprocess_ecog(self, raw_ecog):
        """预处理 ECoG"""
        
        # 1. 去噪
        filtered = self.bandpass_filter(raw_ecog, 0.5, 200)
        
        # 2. 去线噪声
        line_removed = self.remove_line_noise(filtered)
        
        # 3. Z-score 标准化
        normalized = self.zscore_normalize(line_removed)
        
        return normalized
        
    def align_to_speech(self, ecog_data, speech_timestamps):
        """对齐到语音时间戳"""
        
        aligned_segments = []
        
        for timestamp in speech_timestamps:
            start_time = timestamp['start']
            end_time = timestamp['end']
            
            # 提取时间片段
            segment = ecog_data[:, start_time:end_time]
            aligned_segments.append(segment)
        
        return aligned_segments

应用场景

1. 神经编码研究

适用：

研究语音感知的神经机制
开发语音神经编码模型
验证模型层级假设

触发：

神经 encoding 研究
语音 perception 研究
模型 brain alignment

2. 神经形态语音处理

适用：

开发生物启发的语音处理
设计层级语音架构
实现时序语音编码

触发：

神经形态语音系统
生物 plausible 语音模型
时序语音处理

3. BCI 语音解码

适用：

言语 BCI 开发
神经语音解码
言语重建系统

触发：

speech BCI
neural speech decoding
语音 reconstruction from brain

Pitfalls & 注意事项

1. ECoG 数据限制

问题：

电极覆盖有限（只在特定区域）
个体差异大
时间分辨率虽高，但空间分辨率有限

解决方案：

多个体数据聚合
解剖标准化
结合 fMRI 补充空间信息

2. 模型选择

问题：

Whisper 有多个版本（base, small, medium, large）
不同版本层级数不同
选择影响对齐结果

解决方案：

对比多个版本
选择合适规模（base 通常足够）
明确版本和层级配置

3. 时序建模复杂度

问题：

LSTM/RNN 训练成本高
长序列处理慢
注意力机制内存占用大

解决方案：

使用分段处理
优化序列长度
考虑 Transformer 替代 LSTM

扩展方向

1. 其他语音模型

Wav2Vec 2.0 神经对齐
HuBERT 脑响应分析
SpeechT5 层级研究

2. 多模态扩展

视觉-语音联合对齐
多语言语音模型对齐
语义层级研究

3. 实时应用

实时语音神经编码
在线 BCI 解码
自适应神经编码器

参考文献

原始论文：
- Ciferri, Boccato, Olak (2026). arXiv:2606.02305v1
Whisper 模型：
- Radford et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision"
语音神经编码：
- Mesgarani et al. (2014). "Encoding of speech in human auditory cortex"
- Norman-Haignere et al. (2015). "Speech selectivity in human auditory cortex"

关键论文引用

@article{ciferri2026whisper,
  title={Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding},
  author={Ciferri, Matteo and Boccato, Tommaso and Olak, Michal},
  journal={arXiv preprint arXiv:2606.02305v1},
  year={2026},
  categories={q-bio.NC, cs.HC},
  note={Presented at ICLR 2026 Workshop on Representational Alignment}
}

快速启动检查清单

理解时间解析神经编码器架构
理解层级对齐假设（中间层峰值）
理解音素可解释性分析
安装 Whisper 模型
实现 Whisper 嵌嵌入提取
实现 ECoG 数据预处理
实现时间解析编码器
分析层级对齐
执行音素分析
验证解剖一致性

总结

本工作首次系统研究 Whisper 语音模型与人类 ECoG 神经响应的对齐，揭示中间层最强对应神经活动，并通过可解释分析验证解剖学上一致的音素组织。时间解析神经编码器捕获 ECoG 的高时间分辨率，注意力机制揭示时间局部对齐。

核心价值：

层级对齐证据：验证模型层级与皮层处理匹配
时序建模必要性：证明高分辨率神经数据需要时间结构
可解释音素组织：揭示解剖学上一致的音素类别
语音 AI 对齐框架：提供研究语音模型神经基础的方法

适用场景：神经编码研究、神经形态语音处理、BCI 语音解码。