name: vdjdb-harmonize description: Harmonize antigen.gene and antigen.species fields in a VDJdb chunk to canonical VDJdb naming. Detects spurious gene/species names, resolves inconsistencies (same epitope → multiple names), and warns about epitopes that are exact substrings of longer epitopes. Invoked standalone or from /proofread when spurious values are detected.
/harmonize — VDJdb Antigen Gene/Species Harmonization Skill
Purpose
Normalize antigen.gene and antigen.species values in a chunk to the VDJdb canonical vocabulary. This skill uses a unified, layered lookup system drawing from two patch sources:
patches/antigen_epitope_species_gene.dict— epitope-keyed authority: maps known epitopes directly to their canonical species + gene. This is the highest-confidence source.proofreading/gene_aliases.tsv— free-text gene name aliases → VDJdb gene name.proofreading/species_aliases.tsv— species substring fragments → VDJdb species name. Checked in order; first match wins.
Invocation
/harmonize [path-to-tsv]
Or called automatically from /proofread (Step 6a) when spurious antigen fields are detected.
Step 1 — Load Lookup Tables
Load all three sources at the start of the session:
import csv, re
# 1a. Epitope → (species, gene) from patches/
epitope_dict = {} # {epitope: (species, gene)}
with open('patches/antigen_epitope_species_gene.dict') as f:
reader = csv.DictReader(f, delimiter='\t')
for row in reader:
ep = row['antigen.epitope'].strip()
epitope_dict[ep] = (row['antigen.species'].strip(), row['antigen.gene'].strip())
# 1b. Gene alias table
gene_map = {} # {raw_name: canonical}
with open('proofreading/gene_aliases.tsv') as f:
for line in f:
line = line.strip()
if not line or line.startswith('#'): continue
parts = line.split('\t')
if len(parts) >= 2 and parts[0] not in ('source_name', 'vdjdb_name'):
gene_map[parts[0].strip()] = parts[1].strip()
# 1c. Species fragment list (order-sensitive)
species_fragments = [] # [(fragment_lower, canonical), ...]
with open('proofreading/species_aliases.tsv') as f:
for line in f:
line = line.strip()
if not line or line.startswith('#'): continue
parts = line.split('\t')
if len(parts) >= 2 and parts[0] != 'fragment':
species_fragments.append((parts[0].strip().lower(), parts[1].strip()))
Step 2 — Harmonization Functions
Use this unified pipeline for every row:
GENE_STRIP_SUFFIXES = (' protein', ' glycoprotein', ' polyprotein', ' precursor')
def harmonize_gene(raw: str) -> str:
"""Map raw antigen.gene to VDJdb canonical name."""
if not raw:
return raw
# Strip embedded [species] annotation (e.g. "pp65 [CMV]")
s = re.sub(r'\s*\[[^\]]+\]\s*$', '', raw).strip()
# Exact match
if s in gene_map:
return gene_map[s]
# Strip common trailing suffixes and retry
for suffix in GENE_STRIP_SUFFIXES:
if s.lower().endswith(suffix) and len(s) > len(suffix):
stripped = s[:-len(suffix)].strip()
if stripped in gene_map:
return gene_map[stripped]
if len(stripped) >= 2:
return stripped # cleaned fallback
return s
def harmonize_species(raw: str) -> str:
"""Map raw antigen.species to VDJdb canonical name via substring fragments."""
if not raw:
return raw
low = raw.lower()
for fragment, canonical in species_fragments:
if fragment in low:
return canonical
return raw # unchanged if no match
def harmonize_row(row: dict) -> dict:
"""Apply full harmonization pipeline to a single row."""
ep = row.get('antigen.epitope', '').strip()
# Priority 1: epitope dict (most authoritative)
if ep in epitope_dict:
canon_species, canon_gene = epitope_dict[ep]
row['antigen.species'] = canon_species
row['antigen.gene'] = canon_gene
return row
# Priority 2: alias tables
raw_gene = row.get('antigen.gene', '').strip()
raw_species = row.get('antigen.species', '').strip()
new_gene = harmonize_gene(raw_gene)
new_species = harmonize_species(raw_species)
if new_gene != raw_gene:
row['antigen.gene'] = new_gene
if new_species != raw_species:
row['antigen.species'] = new_species
return row
Step 3 — Detect Spurious Values
Before harmonizing, scan the chunk and flag any rows where antigen.gene or antigen.species looks suspicious. These are the patterns that trigger automatic /harmonize invocation from /proofread:
3a. Spurious antigen.gene indicators
| Pattern | Example | Action |
|---|---|---|
Contains [...] species annotation |
pp65 [CMV] |
Strip annotation, re-lookup |
Ends with protein, glycoprotein, polyprotein, precursor |
Spike glycoprotein |
Strip suffix, canonical lookup |
Starts with Probable , Putative , Chain [A-Z], , MULTISPECIES: |
Probable ATP-dependent RNA helicase DDX5 |
Strip prefix, canonical lookup |
| Length > 25 characters | RNA-directed RNA polymerase catalytic subunit |
Alias lookup; convert to HGNC symbol or abbreviation |
Contains , (comma) |
Sterol-4-alpha-carboxylate 3-dehydrogenase, decarboxylating |
Full description — look up in proofreading/gene_aliases.tsv, convert to gene symbol |
| Contains spaces AND is not a known multi-word canonical name | Nucleoprotein M1 |
Flag for manual review |
Matches Polyprotein or polyprotein exactly |
Flag — epitopes from polyprotein should have specific gene assigned; look up by epitope in dict | |
| Null / empty | Flag no.antigen.gene |
VDJdb gene naming conventions:
- Human/mouse genes: HGNC uppercase symbol (e.g.,
PABPC1,SMC1A,COL18A1) - Viral genes: use established short names from literature (e.g.,
pp65,BMLF1,Gag,Pol,Tax) - Bacterial/parasitic genes: use standard gene symbol or protein abbreviation (e.g.,
glnA,GRA6,yeiH) - Avoid: full UniProt protein descriptions, parenthetical qualifiers,
Chain [X],prefixes from PDB entries - Special cases:
P protein(HBV) →Pol; LCMVGp33(variant)→GPC;MULTISPECIES:prefix → strip prefix then look up
KNOWN_MULTIWORD_GENES = {'PB1-F2', 'non-structural', 'Large T antigen', 'HLA-DRB1',
'HLA-DQB1', 'HLA-DPB1', 'NY-ESO-1', 'CORT_0A05310'} # extend as needed
def is_spurious_gene(gene: str) -> bool:
if not gene: return True
if re.search(r'\[.+\]', gene): return True
if any(gene.lower().endswith(s) for s in GENE_STRIP_SUFFIXES): return True
if len(gene) > 30 and gene not in KNOWN_MULTIWORD_GENES: return True
if gene.lower() in ('polyprotein', 'unknown', 'na', 'n/a'): return True
return False
3b. Spurious antigen.species indicators
| Pattern | Example | Action |
|---|---|---|
| Contains full scientific name with spaces | Human herpesvirus 4 |
Fragment match lookup in proofreading/species_aliases.tsv |
| Contains parenthetical common name | Columba livia (carrier pigeon) |
Strip parens, apply CamelCase: ColumbaLivia |
| Known alias variants | HIV, HTLV, Influenza, IAV |
Map to canonical |
| Capitalization mismatch | Epstein barr virus, influenzaA |
Normalise |
| Not in known canonical set | anything not in the list below | Flag for review |
VDJdb species naming conventions:
- Two-word binomial names → CamelCase with no space:
Homo sapiens→HomoSapiens,Bacillus subtilis→BacillusSubtilis - Common abbreviations for well-known pathogens:
EBV,CMV,HIV-1,SARS-CoV-2,InfluenzaA, etc. - Strip parenthetical common names and former names:
Columba livia (carrier pigeon)→ColumbaLivia;Schinkia azotoformans (Bacillus azotiformans)→SchinkiaAzotoformans - Genus-only entries (when species is unknown): keep as single CamelCase word:
Bacillus [genus]→Bacillus PseudomonasFluorescens,PseudomonasAeruginosa— already CamelCase but missing space between genus and species (both are acceptable as-is if already in database)
Known canonical antigen.species values (derive from existing chunks):
EBV, CMV, MCMV, HSV-1, HSV-2, VZV, InfluenzaA, InfluenzaB, SARS-CoV-2, SARS-CoV,
HCoV-OC43, HCoV-HKU1, HIV-1, HCV, HBV, YFV, DENV, DENV1, DENV2, DENV3, DENV3/4,
LCMV, HTLV-1, MLV, RSV, MCPyV, HomoSapiens, MusMusculus, RattusNorvegicus,
MacacaMulatta, GallusGallus, M.tuberculosis, PlasmodiumFalciparum, PlasmodiumBerghei,
Trypanosoma cruzi, SIV, CoxsackievirusB, Wheat, ManducaSexta
CANONICAL_SPECIES = {
'EBV', 'CMV', 'MCMV', 'HSV-1', 'HSV-2', 'VZV', 'InfluenzaA', 'InfluenzaB',
'SARS-CoV-2', 'SARS-CoV', 'HCoV-OC43', 'HCoV-HKU1', 'HIV-1', 'HCV', 'HBV',
'YFV', 'DENV', 'DENV1', 'DENV2', 'DENV3', 'DENV3/4', 'LCMV', 'HTLV-1',
'MLV', 'RSV', 'MCPyV', 'HomoSapiens', 'MusMusculus', 'RattusNorvegicus',
'MacacaMulatta', 'GallusGallus', 'M.tuberculosis', 'PlasmodiumFalciparum',
'PlasmodiumBerghei', 'Trypanosoma cruzi', 'SIV', 'CoxsackievirusB',
'Wheat', 'ManducaSexta',
}
def is_spurious_species(species: str) -> bool:
if not species: return True
if species in CANONICAL_SPECIES: return False
if ' ' in species: return True # multi-word → likely not normalised
return True # unknown single token → flag
Step 4 — Cross-Chunk Consistency Check
After harmonizing individual rows, check for same-epitope inconsistencies across the chunk (and optionally across all of chunks/):
from collections import defaultdict
def check_consistency(rows: list[dict]) -> list[str]:
warnings = []
ep_genes = defaultdict(set)
ep_species = defaultdict(set)
for r in rows:
ep = r.get('antigen.epitope', '').strip()
g = r.get('antigen.gene', '').strip()
s = r.get('antigen.species', '').strip()
if ep:
if g: ep_genes[ep].add(g)
if s: ep_species[ep].add(s)
for ep, genes in ep_genes.items():
if len(genes) > 1:
warnings.append(f'INCONSISTENT antigen.gene for epitope {ep!r}: {genes} — check patches/antigen_epitope_species_gene.dict')
for ep, sps in ep_species.items():
if len(sps) > 1:
warnings.append(f'INCONSISTENT antigen.species for epitope {ep!r}: {sps} — check patches/antigen_epitope_species_gene.dict')
return warnings
If the epitope dict has a definitive entry, the inconsistency will be resolved by Step 2. Report remaining inconsistencies (those the dict does not cover) for manual curation.
Step 5 — Epitope Substring Warning
Check whether any epitope in the chunk is an exact substring of a longer epitope in the same chunk, or in the global VDJdb database. This detects truncation artefacts (e.g., EPLPQGQLTAY being a substring of GPEPLPQGQLTAY).
def check_epitope_substrings(rows: list[dict], global_epitopes: set[str] = None) -> list[str]:
"""
Warn when epitope A is an exact substring of longer epitope B.
Checks within the chunk and against global_epitopes if provided.
"""
warnings = []
chunk_epitopes = {r.get('antigen.epitope', '').strip() for r in rows if r.get('antigen.epitope')}
all_epitopes = chunk_epitopes | (global_epitopes or set())
for ep in sorted(chunk_epitopes):
if len(ep) < 4: continue
for longer in all_epitopes:
if longer != ep and ep in longer:
src = 'chunk' if longer in chunk_epitopes else 'global VDJdb'
warnings.append(
f'SUBSTRING WARNING: {ep!r} is contained in longer epitope {longer!r} ({src}) — '
f'possible truncation artefact; verify correct epitope boundaries'
)
return warnings
To load global VDJdb epitopes for the cross-chunk check:
import glob
def load_global_epitopes() -> set[str]:
eps = set()
for f in glob.glob('chunks/*.txt'):
with open(f) as fh:
for row in csv.DictReader(fh, delimiter='\t'):
ep = row.get('antigen.epitope', '').strip()
if ep: eps.add(ep)
return eps
Step 6 — Apply and Report
Run the full harmonization on the chunk, report all changes and warnings:
def harmonize_chunk(path: str, check_global: bool = True) -> None:
with open(path) as f:
rows = list(csv.DictReader(f, delimiter='\t'))
fieldnames = list(rows[0].keys()) if rows else []
changes = []
for i, row in enumerate(rows):
orig_gene = row.get('antigen.gene', '')
orig_species = row.get('antigen.species', '')
row = harmonize_row(row)
if row['antigen.gene'] != orig_gene:
changes.append(f' ROW {row["chunk.id"]}: antigen.gene {orig_gene!r} → {row["antigen.gene"]!r}')
if row['antigen.species'] != orig_species:
changes.append(f' ROW {row["chunk.id"]}: antigen.species {orig_species!r} → {row["antigen.species"]!r}')
rows[i] = row
# Consistency check
consistency_warnings = check_consistency(rows)
# Substring check
global_eps = load_global_epitopes() if check_global else None
substring_warnings = check_epitope_substrings(rows, global_eps)
# Write back
with open(path, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t')
writer.writeheader()
writer.writerows(rows)
# Report
print(f'=== HARMONIZATION REPORT: {path} ===')
print(f'Rows processed: {len(rows)}')
print(f'Fields changed: {len(changes)}')
for c in changes: print(c)
if consistency_warnings:
print(f'\nConsistency warnings ({len(consistency_warnings)}):')
for w in consistency_warnings: print(f' {w}')
if substring_warnings:
print(f'\nSubstring warnings ({len(substring_warnings)}):')
for w in substring_warnings: print(f' {w}')
if not changes and not consistency_warnings and not substring_warnings:
print('No issues found.')
Integration with /proofread
When /proofread is running Step 6 (MHC Consistency) or scanning antigen.gene/antigen.species fields, it should invoke harmonization automatically if any of these are detected:
is_spurious_gene(row['antigen.gene'])returns True for ≥1 rowis_spurious_species(row['antigen.species'])returns True for ≥1 rowcheck_consistency(rows)returns any warnings
From /proofread Step 6, add:
Step 6a — Antigen Harmonization Trigger
After running ChunkQC, scan all
antigen.geneandantigen.speciesvalues using the spurious-value detectors from/harmonize. If any are flagged:
- Report the count and examples to the user.
- Ask: "Run
/harmonizeto fix these automatically? [y/n]"- If yes: run the full harmonization pipeline, then re-run ChunkQC to verify no regressions.
Patch File Maintenance
When harmonization fails for a value (no alias match, no epitope dict entry), add the mapping to the appropriate patch file rather than leaving it unfixed:
- New epitope → species/gene mapping → append to
patches/antigen_epitope_species_gene.dict - New gene alias (free-text → canonical) → append to
proofreading/gene_aliases.tsv - New species fragment (substring → canonical) → append to
proofreading/species_aliases.tsv(put more-specific fragments before less-specific ones)
Reference Files
| File | Role |
|---|---|
patches/antigen_epitope_species_gene.dict |
Epitope-keyed authority: epitope → (species, gene) |
proofreading/gene_aliases.tsv |
Free-text gene alias table → VDJdb gene name |
proofreading/species_aliases.tsv |
Species fragment → VDJdb canonical (order-sensitive) |
py_src/ChunkQC.py |
Run after harmonization to verify no regressions |