name: iphoneme-brain-to-text-als-conformerxl description: "iPhoneme brain-to-text communication system for ALS using ConformerXL phoneme decoder with gaze-assisted interface. Achieves 92.14% phoneme accuracy (7.86% PER) and 73.39% word accuracy on T15 intracranial EEG dataset. 180ms latency on CPU. Activation: brain-to-text, speech BCI, phoneme decoding, Conformer, ALS, intracranial EEG, iEEG."
iPhoneme: Brain-to-Text Communication for ALS Using ConformerXL Decoding
arXiv: 2604.16441
Published: 2026-04-07
Authors: Yoonmin Cha, Dawit Chun, Sung Park
Categories: cs.SD, cs.AI, cs.CL
Problem
Speech BCIs for ALS face two critical challenges:
- Neural decoding accuracy limits practical deployment
- Input interface design suffers from Midas touch problem (unintended selections in eye-tracking)
Despite transformative potential for 173,000-232,500 ALS patients worldwide, high-performance speech BCIs demonstrated in only 22-31 patients globally.
Core System: iPhoneme
Component 1: ConformerXL Phoneme Decoder (192.9M parameters)
Architecture
- Temporal Prenet: Multi-scale dilated convolutions + bidirectional GRU
- Handles neural jitter correction across temporal scales
- Dilated convolutions capture long-range temporal dependencies
- Temporal Subsampling: Reduces sequence length for CTC training stability
- 12 Encoder Blocks with Pre-RMSNorm stabilization
- Conformer architecture combining CNN + self-attention
- Pre-RMSNorm instead of Post-LayerNorm for training stability
Training
- Optimizer: AdamW with cosine scheduling
- Loss: CTC (Connectionist Temporal Classification) for alignment-free phoneme prediction
- 6-gram phoneme language model trained on 3.1M sequences
- WFST beam search (beam=128) for decoding
Component 2: Gaze-Assisted Phoneme Input Interface
Chorded Gaze-Plus-Silent-Speech Paradigm
- Replaces traditional dwell-time selection
- Chorded input: Combines gaze direction with silent speech attempt
- Mitigates Midas touch problem through multi-modal verification
- Enables more efficient phoneme input rate
Key Results
T15 Dataset (256-channel intracranial EEG)
| Metric | Score |
|---|---|
| Phoneme Accuracy | 92.14% |
| Phoneme Error Rate (PER) | 7.86% |
| Word Accuracy | 73.39% |
| Word Error Rate (WER) | 26.61% |
| Inference Latency | 180 ms (CPU) |
- ~3% above prior state-of-the-art
- Real-time operation on standard CPU hardware
Technical Details
Data
- T15 dataset: 45 sessions, 8,071 trials
- 256-channel intracranial EEG from speech motor cortex regions
- Intracranial (iEEG/ECoG) signals — higher SNR than scalp EEG
Phoneme Language Model
- 6-gram model trained on 3.1M phoneme sequences
- Integrated via Weighted Finite-State Transducer (WFST)
- Beam search with beam width = 128 for efficient decoding
Neural Jitter Correction
- Temporal prenet with multi-scale dilated convolutions handles timing variability
- Bidirectional GRU captures forward/backward temporal context
- Critical for handling non-deterministic neural response timing
Reusable Methodology
1. ConformerXL for Neural Signal Decoding
# Architecture pattern
Input → TemporalPrenet(dilated_conv + BiGRU)
→ Subsampling
→ 12x ConformerBlock(Pre-RMSNorm)
→ CTC Loss
2. Gaze-Assisted Interface Design
- Chorded paradigm: gaze_direction + silent_speech → phoneme selection
- Dual verification prevents unintended inputs
- Applicable to other BCI modalities
3. Phoneme-Level Brain-to-Text Pipeline
- Record iEEG from speech motor cortex
- Temporal preprocessing with jitter correction
- ConformerXL phoneme prediction
- WFST beam search with language model
- Phoneme-to-text conversion
Applications
- ALS communication: Primary target for speech restoration
- Locked-in syndrome: Brain-to-text for completely paralyzed patients
- Speech neuroprosthetics: General speech BCI applications
- Real-time BCI: 180ms latency enables conversational use
Datasets
- T15: 256-channel intracranial EEG
- 45 recording sessions
- 8,071 trials total
- Speech motor cortex coverage
Key Innovations
- ConformerXL adaptation for neural signal phoneme decoding (192.9M params)
- Multi-scale temporal prenet for neural jitter correction
- Chorded gaze-plus-silent-speech interface replacing dwell-time
- CPU real-time operation at 180ms latency
- State-of-the-art phoneme (92.14%) and word (73.39%) accuracy
Limitations
- Requires intracranial EEG (invasive) — not applicable to non-invasive BCI
- Performance on limited patient population
- Language model trained on English phonemes only
- 192.9M parameters — large model size
Related Skills
brain-to-speech-prosody-feature-engineering: Brain-to-speech synthesisbrain-to-speech-transformer-reconstruction: Speech reconstruction from brain signalseeg-foundation-model-adapters: EEG foundation models with adaptationneural-population-decoding: Neural population decoding methods