name: Multi-Modal Alignment for Shared Embedding Space description: >- This skill should be used when the user asks to "align modality representations", "multi-modal contrastive learning", "cross-modal alignment", "CLIP-style training", "modality gap reduction", "shared embedding space", "cross-attention alignment", "vision-language alignment", "audio-text alignment", "representation binding", "cross-modal retrieval", "modality projection heads", "alignment loss function", "modal invariance", "add contrastive loss", "implement projection heads", "fix modality gap", "add uniformity loss", "implement cross-modal retrieval", "add hard negative mining", "implement curriculum alignment", "add alignment metrics", "fix representation collapse", "add gap regularization", "implement SigLIP loss", "add temporal binding", "fix spatial correspondence", or mentions InfoNCE, contrastive alignment, projection head architecture, modality gap phenomenon, cross-modal similarity, alignment temperature, or shared semantic space in the cognitive pipeline. version: 0.1.0
Multi-Modal Alignment for Shared Embedding Space
Purpose
This skill ensures that encoder outputs from different modalities (vision, text, audio, sensors) occupy a shared semantic space where representations can be meaningfully compared, retrieved, and fused. Without alignment, the Global Workspace receives representations from different modalities that cluster in disjoint regions of embedding space, making cross-modal competition, retrieval, and transfer unreliable. Multi-modal alignment is the critical bridge between modality-specific encoders and the unified workspace.
Key Files
| Target Module | Template Asset | Purpose |
|---|---|---|
brain_ai/alignment/aligner.py |
assets/modality_aligner_template.py |
ModalityAligner with projection heads, forward pass |
brain_ai/alignment/losses.py |
assets/contrastive_losses_template.py |
InfoNCE, SigLIP, uniformity + alignment losses |
brain_ai/alignment/retrieval.py |
assets/cross_modal_retriever_template.py |
CrossModalRetriever for retrieval evaluation |
brain_ai/alignment/metrics.py |
assets/alignment_metrics_template.py |
AlignmentMetrics computing gap, recall, uniformity |
brain_ai/alignment/config.py |
assets/alignment_config_template.py |
AlignmentConfig dataclass with validation |
Public Contract
All alignment interactions use the ModalityAligner interface:
forward(
modality_a: str,
feats_a: Tensor, # (B, T_a, D_enc)
modality_b: str,
feats_b: Tensor, # (B, T_b, D_enc)
mask_a: Optional[Tensor] = None, # (B, T_a) bool
mask_b: Optional[Tensor] = None, # (B, T_b) bool
) -> AlignmentOutput
The AlignmentOutput contains projected representations, similarity matrix,
and optional loss terms:
| Field | Shape | Dtype | Description |
|---|---|---|---|
proj_a |
(B, D_align) |
float | Projected + pooled representation for modality A |
proj_b |
(B, D_align) |
float | Projected + pooled representation for modality B |
similarity |
(B, B) |
float | Cross-modal similarity matrix |
loss |
() |
float | Contrastive alignment loss (when training) |
metrics |
Dict |
-- | Alignment diagnostics (gap, uniformity, etc.) |
Contrastive Alignment (CLIP-Style)
The primary alignment mechanism uses contrastive learning with InfoNCE loss over paired representations from different modalities.
InfoNCE Loss
Given a batch of B paired representations (z_a, z_b):
sim(i, j) = cos(z_a_i, z_b_j) / temperature
loss = -0.5 * (CE(sim, I) + CE(sim^T, I))
Where CE is cross-entropy with identity labels (diagonal = positive pairs).
Temperature controls sharpness: lower temperature produces harder contrastive
signals. Typical range: 0.01 to 0.1 (learnable or fixed).
SigLIP Variant
Replaces softmax cross-entropy with pairwise sigmoid binary cross-entropy:
loss = -mean(log_sigmoid(t_ij * (2 * y_ij - 1)))
Where t_ij = sim(i, j) / temperature and y_ij = 1 if i == j else 0.
SigLIP scales better with batch size and avoids the global softmax bottleneck.
Batch Size Effects
Contrastive alignment quality scales with batch size because more in-batch negatives improve the quality of the learned similarity function. Effective strategies for small-batch regimes include memory banks, momentum encoders, and gradient accumulation of the similarity matrix.
Projection Heads
Learnable projectors map from each encoder's output space to the shared alignment space. Projection heads are critical for decoupling the encoder representation (optimized for modality-specific tasks) from the alignment representation (optimized for cross-modal similarity).
Architecture Options
| Variant | Architecture | When to Use |
|---|---|---|
| Linear | Linear(D_enc, D_align) |
Baseline, fast, fewer params |
| MLP-1 | Linear -> LayerNorm -> GELU -> Linear |
Default, good accuracy/cost tradeoff |
| MLP-2 | Linear -> LN -> GELU -> Linear -> LN -> GELU -> Linear |
Large-scale, high-capacity |
All projectors output L2-normalized embeddings. The normalization happens after projection, ensuring that cosine similarity is well-behaved.
Modality-Specific vs Shared
Each modality gets its own projection head. Sharing projectors across modalities is an anti-pattern because encoder output distributions differ substantially between modalities (vision features from patch embeddings vs text features from token embeddings have very different statistics).
Modality Gap Problem
Even after contrastive training, representations from different modalities often cluster in separate regions of the shared space. This "modality gap" means that a vision embedding and a text embedding describing the same concept may have high cosine similarity relative to other pairs but still occupy distinct neighborhoods in absolute terms.
Causes
- Initialization bias -- random projector init creates modality-specific clusters that training may not fully dissolve.
- Cone effect -- L2-normalized embeddings lie on a hypersphere; modalities occupy different cones on that sphere.
- Training dynamics -- contrastive loss only requires relative ordering (positive > negative), not absolute proximity.
Solutions
| Technique | Mechanism | Config Field |
|---|---|---|
| Uniformity loss | Penalizes clustering on the hypersphere | use_uniformity_loss |
| Centering | Subtract running mean per modality | use_centering |
| Gap regularization | Penalize distance between modality centroids | gap_reg_weight |
| Shared init | Initialize projectors from common weights | shared_init |
Cross-Attention Alignment
Complementary to contrastive alignment, cross-attention computes soft alignment scores between token-level representations from two modalities. This is useful when fine-grained correspondence matters (e.g., which image patch corresponds to which text token).
# Cross-attention: Q from modality A, K/V from modality B
attn_scores = softmax(Q_a @ K_b^T / sqrt(d_k)) @ V_b
Cross-attention alignment produces token-level alignment maps that can be supervised (when ground-truth correspondences exist) or used unsupervised as auxiliary alignment signals.
Hard Negative Mining
Standard contrastive learning uses all in-batch pairs as negatives. Hard negative mining selects the most informative negatives -- those that are similar but not matching. Strategies:
- In-batch hard negatives -- for each positive pair, identify the highest-scoring negative within the batch.
- Semi-hard negatives -- negatives that are closer than the positive but still on the correct side of the margin.
- Cross-modal hard negatives -- negatives from different modalities that share partial semantic overlap.
Curriculum Alignment
Progressive training strategy:
| Phase | Epoch Range | Pairs | Difficulty |
|---|---|---|---|
| Warm-up | 0 -- 5 | Easy, high-similarity pairs | Low temperature |
| Standard | 5 -- 20 | All pairs, uniform sampling | Medium temperature |
| Hard | 20+ | Hard negative emphasis | Higher temperature |
Curriculum alignment prevents early representation collapse by starting with clearly distinct pairs before introducing ambiguous cases.
Integration with Global Workspace
Alignment quality directly affects workspace competition. When modality representations are well-aligned:
- Competition scoring -- cross-modal similarity scores are meaningful, enabling fair comparison across modalities.
- Workspace slots -- fused slot representations combine information from multiple modalities coherently.
- Broadcast quality -- broadcast packets carry semantically rich content that downstream modules can interpret regardless of source modality.
The ModalityAligner runs as a pre-processing step before workspace competition. Encoder outputs pass through their respective projection heads to produce aligned representations before entering the workspace.
Binding Problem
Ensuring correct feature binding across modalities -- the right visual features associate with the right text tokens -- requires explicit mechanisms:
- Temporal synchrony -- features arriving at the same timestamp from different modalities are assumed to correspond.
- Spatial correspondence -- for vision-text pairs, spatial attention maps indicate which image regions correspond to which text spans.
- Feature binding tags -- optional learned binding vectors that modulate cross-modal attention based on semantic type.
Configuration Surface
AlignmentConfig
| Field | Default | Purpose |
|---|---|---|
alignment_dim |
512 | Shared embedding dimension |
temperature |
0.07 | Contrastive loss temperature |
learnable_temperature |
True | Make temperature a learnable parameter |
loss_type |
"infonce" |
Loss: infonce, siglip, or combined |
projector_type |
"mlp_1" |
Projector: linear, mlp_1, mlp_2 |
projector_dropout |
0.1 | Dropout in projection heads |
use_uniformity_loss |
True | Add uniformity regularization |
uniformity_weight |
0.1 | Weight for uniformity loss |
use_centering |
True | Subtract running modality centroids |
centering_momentum |
0.9 | EMA momentum for centroid tracking |
gap_reg_weight |
0.01 | Weight for modality gap regularization |
hard_negative_mining |
False | Enable hard negative selection |
curriculum_phases |
3 | Number of curriculum difficulty phases |
Presets: AlignmentConfig.minimal(), AlignmentConfig.dev(),
AlignmentConfig.production().
Done-When Gates
| Gate | Criterion |
|---|---|
| Contrastive loss converges | Loss < 2.0 for B >= 64 after 1K steps |
| Modality gap < threshold | Mean centroid distance < 0.3 in shared space |
| Cross-modal retrieval R@1 | > 50% on validation pairs |
| Uniformity score | > -2.0 (log uniformity on hypersphere) |
| No representation collapse | Singular value ratio > 0.01 |
| Gradient flows end-to-end | Non-zero grad on all projection parameters |
| Temperature in valid range | 0.001 <= temperature <= 1.0 after training |
| Alignment metrics stable | Variance of metrics < 5% over 100 steps |
Common Failure Modes
| Symptom | Cause | Fix |
|---|---|---|
| All embeddings collapse to a point | Temperature too low or no uniformity loss | Raise temperature, enable uniformity loss |
| Modality gap persists after training | No gap regularization or centering | Enable centering + gap_reg_weight |
| Cross-modal retrieval near random | Projection heads too small or mismatched | Use MLP projector, check D_align |
| Loss oscillates without converging | Learning rate too high for projectors | Reduce projector LR, use warmup |
| Temperature diverges | Unconstrained learnable temperature | Clamp temperature to [0.001, 1.0] |
| Gradient explosion in similarity matrix | Large batch + low temperature | Gradient clipping, increase temperature |
Anti-Patterns
- Sharing projectors across modalities -- each modality needs its own head
- Skipping L2 normalization -- cosine similarity requires unit-norm vectors
- Using MSE instead of contrastive loss -- MSE collapses representations
- Ignoring the modality gap -- leads to unfair workspace competition
- Training alignment separately from the pipeline -- must backprop through workspace
- Using fp16 for temperature -- temperature needs fp32 precision
- Batch size < 32 without memory bank -- too few negatives for contrastive learning
Additional Resources
Reference Files
references/contrastive-theory.md-- InfoNCE math, temperature tuning, CLIP/SigLIP variants, batch size effectsreferences/projection-architectures.md-- Linear vs MLP projectors, normalization, dimension choicesreferences/modality-gap.md-- Gap phenomenon, uniformity loss, centering, gap regularizationreferences/binding-mechanisms.md-- Temporal binding, spatial correspondence, feature bindingreferences/testing-matrix.md-- All test cases, pytest patterns, done-when checklist
Asset Templates
assets/modality_aligner_template.py-- ModalityAligner, ProjectionHead, AlignmentOutput, self-testassets/contrastive_losses_template.py-- InfoNCE, SigLIP, uniformity + alignment, combined losses, self-testassets/cross_modal_retriever_template.py-- CrossModalRetriever, retrieval evaluation, self-testassets/alignment_metrics_template.py-- AlignmentMetrics, gap measurement, uniformity, SVD health, self-testassets/alignment_config_template.py-- AlignmentConfig dataclass, validation, presets, serialization, self-test
Scripts
scripts/validate_alignment.py-- Runtime contract validation (projection shapes, loss values, metric ranges)scripts/gen_alignment_tests.py-- Generatestests/test_alignment.py(~70+ test cases)scripts/alignment_benchmark.py-- Performance benchmarking (throughput, memory, scaling)