name: vdjdb-format description: Standardise a raw or partially-formatted VDJdb TSV chunk — normalising species names, IMGT V/D/J gene IDs, IMGT-HLA MHC alleles, and method vocabulary — and produce a properly-named chunk file ready for proofreading.
/format — VDJdb Chunk Formatting Skill
Purpose
Take the output of /extract (or any TSV resembling a chunk file) and standardise all controlled-vocabulary fields to the VDJdb / IMGT specification. Cross-check naming conventions against existing chunks/ files to ensure internal consistency. This is the second stage: extract → format → proofread.
Invocation
/format [path-to-tsv]
The TSV must have a VDJdb-compatible header (see canonical column order in /extract). If the file has structural problems (missing columns, wrong separator), halt and report — this is a job for /proofread Step 1, not format.
Standardisation Rules
1. Species Names
Normalise to the exact VDJdb-accepted values (case-sensitive):
| Normalise FROM | Normalise TO |
|---|---|
Homo sapiens, human, H. sapiens, hs, Human |
HomoSapiens |
Mus musculus, mouse, M. musculus, mm, Mouse |
MusMusculus |
Rattus norvegicus, rat, Rattus |
RattusNorvegicus |
Macaca mulatta, rhesus, macaque, NHP |
MacacaMulatta |
If a species is not in the above list:
- Log it as a candidate for extending
speciesListinpy_src/ChunkQC.py - Ask the user whether to include or exclude those rows
2. IMGT V/D/J Gene IDs
Primary authority: proofreading/imgt_alleles.tsv.gz (column imgt_gene_id)
Conversion table: patches/nomenclature.conversions
Secondary fallback: patches/IGM_nomenclature_table.tsv
Rules (apply in order):
Strip whitespace: remove all spaces within the gene name (
TRBV 7→TRBV7,TRAV 12-2→TRAV12-2)Detect and convert Adaptive Biotech ImmunoSEQ names (see
proofreading/imgt.md§9.2 for full details):- Full Adaptive prefix (
TCRB,TCRA,TCRG,TCRD): replace withTRB,TRA,TRG,TRDTCRBV06-05*01→ stripTCRprefix →TRBV06-05*01
- Zero-padded subgroup: strip leading zeros from subgroup number
TRBV06-5→TRBV6-5
- Zero-padded cluster: strip leading zeros from cluster number
TRBV7-06→TRBV7-6,TRBV4-01→TRBV4-1
- Verify result in
imgt_alleles.tsv.gz: ifgene-clusteris not found, try the bare gene name (Adaptive always appends-01to single-cluster genes that IMGT names without a cluster)TRBV19-01→TRBV19-1not found →TRBV19found ✓TRBV11-02→TRBV11-2found ✓
- When source is Adaptive, note in format log:
ADAPTIVE_NAME → IMGT_NAME (Adaptive ImmunoSEQ normalisation)
- Full Adaptive prefix (
Look up in
imgt_alleles.tsv.gz(strip allele suffix*NNbefore lookup):- If found → keep (or correct capitalisation to match)
- If not found → check
patches/nomenclature.conversionsfor a mapping - If found in conversions → apply the conversion and log
old_name → new_name - If not found in either → flag as unresolvable; ask user
Validate allele (if present, e.g.,
TRBV12-3*02):- Look up the full allele name in
imgt_allele_idcolumn:gzip -dc proofreading/imgt_alleles.tsv.gz | awk -F'\t' '$3=="TRBV12-3*02"' - If not found as a complete allele: flag as invalid; check whether the gene itself exists (gene-level lookup)
- Look up the full allele name in
Check functionality: if
functionalityisP(pseudogene) inimgt_alleles.tsv.gz: flag as biologically suspiciousConsistency check against existing chunks:
grep -h "" chunks/*.txt | cut -f3 | sort -u | grep "^TRAV" # check v.alpha valuesIf the same gene appears with different notation in existing chunks (e.g.,
TRAV13-1vsTRAV13), standardise to the IMGT-canonical form.Multiple gene possibilities (comma-separated ambiguous assignments): check each against
imgt_alleles.tsv.gz, keep all valid candidates comma-separated without spaces (e.g.,TRBV7-2,TRBV7-3)
3. MHC Alleles
Primary authority: proofreading/mhc_alleles.tsv.gz (column allele_name)
Reference: proofreading/mhc.md
Human MHC (HLA)
Target format: HLA-<GENE>*<FIELD1>:<FIELD2> (e.g., HLA-A*02:01)
| Problem | Fix |
|---|---|
A02, A0201 (old serological) |
→ HLA-A*02:01 if unambiguous; flag if ambiguous |
A*0201 (old format, no colon) |
→ HLA-A*02:01 (add prefix + insert colon) |
HLA-A*02 (low resolution, 1-field) |
Keep as-is; note in log that higher resolution preferred |
HLA-A*02:01:01 or *02:01:01:01 (high-res) |
Keep full string as-is |
HLA-A 02:01 (space) |
→ HLA-A*02:01 |
HLA-A*02:01N, *02:01L (expression suffix) |
Keep suffix; note in log |
Confirmed status from mhc_alleles.tsv.gz |
gzip -dc proofreading/mhc_alleles.tsv.gz | awk -F'\t' '$2=="HLA-A*02:01"{print $3}' |
MHC-I second chain: always normalise to literal B2M — never beta-2-microglobulin, β2m, b2m, B2M*01, etc.
mhc.class cross-check:
If mhc.a starts with... |
mhc.class must be |
mhc.b must be |
|---|---|---|
HLA-A, HLA-B, HLA-C, HLA-E, HLA-F, HLA-G |
MHCI |
B2M |
HLA-DR, HLA-DQ, HLA-DP, HLA-DO |
MHCII |
HLA β-chain allele |
Mouse MHC (H-2)
| Normalise FROM | Normalise TO |
|---|---|
H2-Db, H2Db |
H-2Db |
IAb, I-Ab, IA-b |
I-Ab |
H-2D^b |
H-2Db |
For mouse class I: mhc.b = B2M
For mouse class II: mhc.b = the β-chain name (e.g., I-Ab)
4. Method Vocabulary
Normalise method.identification and method.verification to VDJdb-recognised terms.
The governing rule: use what the source says. Do not upgrade or downgrade based on prevalence in VDJdb.
| Author writes | Normalise to | Reasoning |
|---|---|---|
tetramer, pMHC tetramer, tetramer sort |
tetramer-sort |
Source specifies tetramers |
dextramer, dextramer sort |
dextramer-sort |
Source specifies dextramers |
pentamer, pentamer sort |
pentamer-sort |
Source specifies pentamers |
multimer, pMHC multimer, multimer sort |
multimer-sort |
Source gives no more specific reagent type |
| Reagent type not stated (only "sort" or "FACS") | multimer-sort |
Cannot assume tetramer; use generic |
ELISpot |
Do NOT map automatically | Log and ask user |
51Cr release assay |
Do NOT map automatically | Log and ask user |
Example: A readme that says only "tetramer-sort" →
tetramer-sort. A readme that says only "multimer-sort" with no other information →multimer-sort, even if tetramers are the most common reagent in VDJdb. Never infer the reagent type from context or database prevalence.
Rule: If an identification or verification method has no close equivalent in the current VDJdb vocabulary, do NOT force it. Instead:
- Leave a descriptive string in the field (for reference)
- Document it under "Vocabulary gaps" in the format log
- Suggest adding it as a new term via a note to the database maintainers
5. Reference IDs
Enforce correct format:
| Problem | Fix |
|---|---|
https://doi.org/10.1016/... |
→ doi:10.1016/... (remove URL prefix) |
http://dx.doi.org/10.1016/... |
→ doi:10.1016/... |
doi: 10.1016/... (space after colon) |
→ doi:10.1016/... |
pubmed:12345678 or PubMed:12345678 |
→ PMID:12345678 |
Bare number 12345678 |
Ask if it is a PMID; if confirmed → PMID:12345678 |
6. Antigen Fields
Cross-reference patches/antigen_epitope_species_gene.dict:
- If
antigen.epitopeexists in the dict → use the dict'santigen.geneandantigen.species(this ensures consistency with the full database) - If the epitope is new → keep author-provided gene/species values, note in log
7. Chunk ID
After all formatting changes, renumber chunk.id sequentially from 1 (integer, no leading zeros).
Output Filename
Prefer PMID_<pubmed_id>.txt (e.g., PMID_28975614.txt).
If no PMID is available:
- Ask the user for the preferred name
- Alternatives:
doi_<mangled_doi>.txt, submitter-date format - Check that the chosen name does not duplicate an existing file in
chunks/
Suggested placement: chunks_unformatted/ if uncertain about QC status; chunks/ only after /proofread passes.
Format Log
Write <output_basename>_format_log.txt containing:
- Changes made: for each change — field name, old value, new value, source of normalisation (imgt_alleles.tsv.gz / mhc_alleles.tsv.gz / nomenclature.conversions / manual)
- Unresolvable fields: fields that could not be normalised and why
- Vocabulary gaps: novel method/verification terms encountered
- Allele resolution notes: alleles that exist in
mhc_alleles.tsv.gzat low resolution only - Consistency discrepancies: naming differences found vs existing
chunks/files - Pseudogene warnings: gene names whose
functionality = Pinimgt_alleles.tsv.gz - Unconfirmed HLA alleles: alleles present in
mhc_alleles.tsv.gzwithconfirmed = Unconfirmed
Reference Files
| File | Role |
|---|---|
proofreading/imgt_alleles.tsv.gz |
Primary IMGT V/D/J gene authority |
proofreading/imgt.md |
IMGT nomenclature rules |
proofreading/mhc_alleles.tsv.gz |
Primary HLA allele authority |
proofreading/mhc.md |
MHC/HLA naming rules, class I vs II, non-human |
patches/nomenclature.conversions |
Old → current IMGT gene name mappings |
patches/IGM_nomenclature_table.tsv |
Secondary IMGT fallback (existing repo file) |
patches/antigen_epitope_species_gene.dict |
Known epitope → gene/species mappings |
py_src/ScoreFactory.py |
Method vocabulary and scoring logic |
py_src/ChunkQC.py |
ALL_COLS definition (canonical column list) |
chunks/*.txt |
Reference for consistency checks |
README.md |
Full VDJdb specification |
Next Step
After formatting, run /proofread [path-to-formatted-tsv] to validate against py_src/ChunkQC.py and all other QC checks.