vdjdb-format - SKILL.md Agent Skill

name: vdjdb-format description: Standardise a raw or partially-formatted VDJdb TSV chunk — normalising species names, IMGT V/D/J gene IDs, IMGT-HLA MHC alleles, and method vocabulary — and produce a properly-named chunk file ready for proofreading.

/format — VDJdb Chunk Formatting Skill

Purpose

Take the output of /extract (or any TSV resembling a chunk file) and standardise all controlled-vocabulary fields to the VDJdb / IMGT specification. Cross-check naming conventions against existing chunks/ files to ensure internal consistency. This is the second stage: extract → format → proofread.

Invocation

/format [path-to-tsv]

The TSV must have a VDJdb-compatible header (see canonical column order in /extract). If the file has structural problems (missing columns, wrong separator), halt and report — this is a job for /proofread Step 1, not format.

Standardisation Rules

1. Species Names

Normalise to the exact VDJdb-accepted values (case-sensitive):

Normalise FROM	Normalise TO
`Homo sapiens`, `human`, `H. sapiens`, `hs`, `Human`	`HomoSapiens`
`Mus musculus`, `mouse`, `M. musculus`, `mm`, `Mouse`	`MusMusculus`
`Rattus norvegicus`, `rat`, `Rattus`	`RattusNorvegicus`
`Macaca mulatta`, `rhesus`, `macaque`, `NHP`	`MacacaMulatta`

If a species is not in the above list:

Log it as a candidate for extending speciesList in py_src/ChunkQC.py
Ask the user whether to include or exclude those rows

2. IMGT V/D/J Gene IDs

Primary authority: proofreading/imgt_alleles.tsv.gz (column imgt_gene_id) Conversion table: patches/nomenclature.conversions Secondary fallback: patches/IGM_nomenclature_table.tsv

Rules (apply in order):

Strip whitespace: remove all spaces within the gene name (TRBV 7 → TRBV7, TRAV 12-2 → TRAV12-2)
Detect and convert Adaptive Biotech ImmunoSEQ names (see proofreading/imgt.md §9.2 for full details):
- Full Adaptive prefix (TCRB, TCRA, TCRG, TCRD): replace with TRB, TRA, TRG, TRD
  - TCRBV06-05*01 → strip TCR prefix → TRBV06-05*01
- Zero-padded subgroup: strip leading zeros from subgroup number
  - TRBV06-5 → TRBV6-5
- Zero-padded cluster: strip leading zeros from cluster number
  - TRBV7-06 → TRBV7-6, TRBV4-01 → TRBV4-1
- Verify result in imgt_alleles.tsv.gz: if gene-cluster is not found, try the bare gene name (Adaptive always appends -01 to single-cluster genes that IMGT names without a cluster)
  - TRBV19-01 → TRBV19-1 not found → TRBV19 found ✓
  - TRBV11-02 → TRBV11-2 found ✓
- When source is Adaptive, note in format log: ADAPTIVE_NAME → IMGT_NAME (Adaptive ImmunoSEQ normalisation)
Look up in imgt_alleles.tsv.gz (strip allele suffix *NN before lookup):
- If found → keep (or correct capitalisation to match)
- If not found → check patches/nomenclature.conversions for a mapping
- If found in conversions → apply the conversion and log old_name → new_name
- If not found in either → flag as unresolvable; ask user
Validate allele (if present, e.g., TRBV12-3*02):
- Look up the full allele name in imgt_allele_id column: gzip -dc proofreading/imgt_alleles.tsv.gz | awk -F'\t' '$3=="TRBV12-3*02"'
- If not found as a complete allele: flag as invalid; check whether the gene itself exists (gene-level lookup)
Check functionality: if functionality is P (pseudogene) in imgt_alleles.tsv.gz: flag as biologically suspicious
Consistency check against existing chunks:
```
grep -h "" chunks/*.txt | cut -f3 | sort -u | grep "^TRAV"  # check v.alpha values
```
If the same gene appears with different notation in existing chunks (e.g., TRAV13-1 vs TRAV13), standardise to the IMGT-canonical form.
Multiple gene possibilities (comma-separated ambiguous assignments): check each against imgt_alleles.tsv.gz, keep all valid candidates comma-separated without spaces (e.g., TRBV7-2,TRBV7-3)

3. MHC Alleles

Primary authority: proofreading/mhc_alleles.tsv.gz (column allele_name) Reference: proofreading/mhc.md

Human MHC (HLA)

Target format: HLA-<GENE>*<FIELD1>:<FIELD2> (e.g., HLA-A*02:01)

Problem	Fix
`A02`, `A0201` (old serological)	→ `HLA-A*02:01` if unambiguous; flag if ambiguous
`A*0201` (old format, no colon)	→ `HLA-A*02:01` (add prefix + insert colon)
`HLA-A*02` (low resolution, 1-field)	Keep as-is; note in log that higher resolution preferred
`HLA-A02:01:01` or `02:01:01:01` (high-res)	Keep full string as-is
`HLA-A 02:01` (space)	→ `HLA-A*02:01`
`HLA-A02:01N`, `02:01L` (expression suffix)	Keep suffix; note in log
Confirmed status from `mhc_alleles.tsv.gz`	`gzip -dc proofreading/mhc_alleles.tsv.gz \| awk -F'\t' '$2=="HLA-A*02:01"{print $3}'`

MHC-I second chain: always normalise to literal B2M — never beta-2-microglobulin, β2m, b2m, B2M*01, etc.

mhc.class cross-check:

If `mhc.a` starts with...	`mhc.class` must be	`mhc.b` must be
`HLA-A`, `HLA-B`, `HLA-C`, `HLA-E`, `HLA-F`, `HLA-G`	`MHCI`	`B2M`
`HLA-DR`, `HLA-DQ`, `HLA-DP`, `HLA-DO`	`MHCII`	HLA β-chain allele

Mouse MHC (H-2)

Normalise FROM	Normalise TO
`H2-Db`, `H2Db`	`H-2Db`
`IAb`, `I-Ab`, `IA-b`	`I-Ab`
`H-2D^b`	`H-2Db`

For mouse class I: mhc.b = B2M For mouse class II: mhc.b = the β-chain name (e.g., I-Ab)

4. Method Vocabulary

Normalise method.identification and method.verification to VDJdb-recognised terms.

The governing rule: use what the source says. Do not upgrade or downgrade based on prevalence in VDJdb.

Author writes	Normalise to	Reasoning
`tetramer`, `pMHC tetramer`, `tetramer sort`	`tetramer-sort`	Source specifies tetramers
`dextramer`, `dextramer sort`	`dextramer-sort`	Source specifies dextramers
`pentamer`, `pentamer sort`	`pentamer-sort`	Source specifies pentamers
`multimer`, `pMHC multimer`, `multimer sort`	`multimer-sort`	Source gives no more specific reagent type
Reagent type not stated (only "sort" or "FACS")	`multimer-sort`	Cannot assume tetramer; use generic
`ELISpot`	Do NOT map automatically	Log and ask user
`51Cr release assay`	Do NOT map automatically	Log and ask user

Example: A readme that says only "tetramer-sort" → tetramer-sort. A readme that says only "multimer-sort" with no other information → multimer-sort, even if tetramers are the most common reagent in VDJdb. Never infer the reagent type from context or database prevalence.

Rule: If an identification or verification method has no close equivalent in the current VDJdb vocabulary, do NOT force it. Instead:

Leave a descriptive string in the field (for reference)
Document it under "Vocabulary gaps" in the format log
Suggest adding it as a new term via a note to the database maintainers

5. Reference IDs

Enforce correct format:

Problem	Fix
`https://doi.org/10.1016/...`	→ `doi:10.1016/...` (remove URL prefix)
`http://dx.doi.org/10.1016/...`	→ `doi:10.1016/...`
`doi: 10.1016/...` (space after colon)	→ `doi:10.1016/...`
`pubmed:12345678` or `PubMed:12345678`	→ `PMID:12345678`
Bare number `12345678`	Ask if it is a PMID; if confirmed → `PMID:12345678`

6. Antigen Fields

Cross-reference patches/antigen_epitope_species_gene.dict:

If antigen.epitope exists in the dict → use the dict's antigen.gene and antigen.species (this ensures consistency with the full database)
If the epitope is new → keep author-provided gene/species values, note in log

7. Chunk ID

After all formatting changes, renumber chunk.id sequentially from 1 (integer, no leading zeros).

Output Filename

Prefer PMID_<pubmed_id>.txt (e.g., PMID_28975614.txt).

If no PMID is available:

Ask the user for the preferred name
Alternatives: doi_<mangled_doi>.txt, submitter-date format
Check that the chosen name does not duplicate an existing file in chunks/

Suggested placement: chunks_unformatted/ if uncertain about QC status; chunks/ only after /proofread passes.

Format Log

Write <output_basename>_format_log.txt containing:

Changes made: for each change — field name, old value, new value, source of normalisation (imgt_alleles.tsv.gz / mhc_alleles.tsv.gz / nomenclature.conversions / manual)
Unresolvable fields: fields that could not be normalised and why
Vocabulary gaps: novel method/verification terms encountered
Allele resolution notes: alleles that exist in mhc_alleles.tsv.gz at low resolution only
Consistency discrepancies: naming differences found vs existing chunks/ files
Pseudogene warnings: gene names whose functionality = P in imgt_alleles.tsv.gz
Unconfirmed HLA alleles: alleles present in mhc_alleles.tsv.gz with confirmed = Unconfirmed

Reference Files

File	Role
`proofreading/imgt_alleles.tsv.gz`	Primary IMGT V/D/J gene authority
`proofreading/imgt.md`	IMGT nomenclature rules
`proofreading/mhc_alleles.tsv.gz`	Primary HLA allele authority
`proofreading/mhc.md`	MHC/HLA naming rules, class I vs II, non-human
`patches/nomenclature.conversions`	Old → current IMGT gene name mappings
`patches/IGM_nomenclature_table.tsv`	Secondary IMGT fallback (existing repo file)
`patches/antigen_epitope_species_gene.dict`	Known epitope → gene/species mappings
`py_src/ScoreFactory.py`	Method vocabulary and scoring logic
`py_src/ChunkQC.py`	ALL_COLS definition (canonical column list)
`chunks/*.txt`	Reference for consistency checks
`README.md`	Full VDJdb specification

Next Step

After formatting, run /proofread [path-to-formatted-tsv] to validate against py_src/ChunkQC.py and all other QC checks.