vdjdb-duplicates - SKILL.md Agent Skill

name: vdjdb-duplicates description: Identify and classify duplicate TCR records across all VDJdb chunks at three resolution levels (beta-only, paired, and same-epitope multi-MHC), categorise by publication source, author overlap, and flag spurious high-frequency records.

/vdjdb-duplicates — VDJdb Duplicate & Consistency Audit Skill

Purpose

Scan all chunks/ files to:

Find duplicate TCRs at two CDR3 resolution levels.
Classify duplicates as within-publication, cross-publication same-lab, or genuinely independent.
Check author overlap between publications sharing many TCR sequences.
Flag extremely frequent records that suggest read-count inflation rather than unique T cell clones.
Report epitopes presented by multiple distinct MHC molecules or with inconsistent allele resolution.

Invocation

/vdjdb-duplicates

No arguments. Run from the repo root.

Step 1 — Load All Chunks

import csv, glob, re
from collections import defaultdict, Counter

rows_all = []
for path in sorted(glob.glob('chunks/*.txt')):
    fname = path.split('/')[-1]
    with open(path) as f:
        for row in csv.DictReader(f, delimiter='\t'):
            row['_file'] = fname
            rows_all.append(row)

def v(row, col): return (row.get(col) or '').strip()

Step 2 — Beta-Only Duplicates

Key: (cdr3.beta, v.beta, antigen.epitope) — records sharing the same beta chain and epitope regardless of alpha chain or donor metadata.

key_beta = defaultdict(list)
for row in rows_all:
    cb = v(row,'cdr3.beta'); vb = v(row,'v.beta'); ep = v(row,'antigen.epitope')
    if cb and ep:
        key_beta[(cb, vb, ep)].append(row)

dups_beta = {k: vs for k, vs in key_beta.items() if len(vs) > 1}

Classify each duplicate group:

Category	Criterion
Within-file	All rows in one chunk file
Cross-file, same reference	Multiple files, all share the same `reference.id`
Cross-file, same lab	Multiple PMIDs but overlapping authors (check Step 4)
Cross-file, independent	Multiple PMIDs with no shared authors

Count and report each category. List the top 30 groups by frequency.

Step 3 — Paired Duplicates

Key: (cdr3.alpha, v.alpha, j.alpha, cdr3.beta, v.beta, j.beta, antigen.epitope) — exact paired chain duplicates. Only applied to rows where both CDR3 chains are non-empty.

key_pair = defaultdict(list)
for row in rows_all:
    ca = v(row,'cdr3.alpha'); cb = v(row,'cdr3.beta'); ep = v(row,'antigen.epitope')
    if ca and cb and ep:
        k = (ca, v(row,'v.alpha'), v(row,'j.alpha'),
             cb, v(row,'v.beta'),  v(row,'j.beta'), ep)
        key_pair[k].append(row)

dups_pair = {k: vs for k, vs in key_pair.items() if len(vs) > 1}

Report same categories as Step 2.

Step 4 — Author Overlap Check

For each cross-file duplicate group involving ≥2 distinct PMIDs, retrieve author lists and compute overlap:

import urllib.request, json, time

def get_authors(pmid):
    url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id={pmid}&retmode=json'
    with urllib.request.urlopen(url, timeout=10) as r:
        data = json.load(r)
    return [a['name'] for a in data['result'][pmid].get('authors', [])]

# Build PMID-pair → shared author count
pmid_pairs = Counter()
for k, vs in dups_beta.items():
    refs = sorted({v(r,'reference.id') for r in vs if v(r,'reference.id').startswith('PMID:')})
    for i in range(len(refs)):
        for j in range(i+1, len(refs)):
            pmid_pairs[(refs[i], refs[j])] += 1

# For top pairs (> threshold shared duplicates), check authors
OVERLAP_THRESHOLD = 20
for (r1, r2), cnt in pmid_pairs.most_common():
    if cnt < OVERLAP_THRESHOLD: break
    p1, p2 = r1.replace('PMID:',''), r2.replace('PMID:','')
    try:
        a1 = get_authors(p1); time.sleep(0.4)
        a2 = get_authors(p2); time.sleep(0.4)
        shared = set(a1) & set(a2)
        print(f"{r1} × {r2}: {cnt} shared groups, {len(shared)} common authors")
        if shared:
            print(f"  Authors: {', '.join(sorted(shared)[:6])}")
    except Exception as e:
        print(f"  ERROR: {e}")

Classification rule:

≥3 shared authors → "same lab, follow-up publication" (expected overlap, not spurious)
0–2 shared authors → "independent replication" (genuine public clonotype)

Step 5 — High-Frequency Record Detection

Within-file groups with frequency ≥50 from ≤3 distinct donors are suspicious. Distinguish two legitimate assay causes before flagging as data errors:

Pattern A — Single TCR tested against many epitopes (combinatorial assay): The same CDR3 appears with many different epitopes in one study. Each row is a genuine specificity claim. High n within one epitope group from few donors still indicates read-depth, not clonal abundance.

Pattern B — Pool of TCRs tested against one epitope/pattern: Many different CDR3s all assigned the same epitope from a single donor. Here n reflects different clones in the repertoire, not read inflation. This is biologically expected in large repertoire studies.

Distinguishing them: if few donors but many distinct CDR3s → Pattern B (normal). If few donors with the SAME CDR3 repeated n times → Pattern A / read inflation.

FREQ_THRESHOLD = 50
DONOR_THRESHOLD = 3

for k, vs in sorted(dups_beta.items(), key=lambda x: -len(x[1])):
    files = {r['_file'] for r in vs}
    if len(files) > 1: continue   # only single-file
    donors = {v(r,'meta.subject.id') for r in vs}
    clones = {v(r,'meta.clone.id') for r in vs if v(r,'meta.clone.id')}
    if len(vs) >= FREQ_THRESHOLD and len(donors) <= DONOR_THRESHOLD:
        cb, vb, ep = k
        refs = {v(r,'reference.id') for r in vs}
        pattern = 'READ-INFLATION' if len(clones) <= 1 else 'DEEP-REPERTOIRE'
        print(f"{pattern} n={len(vs):4d} CDR3b={cb:22} ep={ep:15} donors={len(donors)} ref={list(refs)[0]}")

Known confirmed cases from vdjdb-db audit (2026-05-31):

File	Epitope	CDR3b	n	Donors	Pattern	Cause
PMID_41315082	VEALYLVCG	CASSEAGTGGYEQYF	530	2	A	scTCR-seq read depth; same clone seen many times across cells
PMID_39746936	VISNDVCAQV	multiple	160–271	1	A	Single donor deep-seq; each clone's frequency encoded as row count
PMID_34811538	RAKFKQLL/CLGGLLTMV	multiple	107–241	2–3	B	Bulk repertoire depth; many clones, legitimate
10xgenomics-2019-07-09	IVTDFSVIK/RAKFKQLL	multiple	100–133	2	B	10x Genomics multiplexed assay; cell barcodes give multiplicity

Interpretation: Pattern A records represent sequencing read depth not unique T cells — flag in release notes but do not remove. Pattern B records are biologically valid and expected.

Step 6 — Multi-MHC Epitope Report

key_mhc = defaultdict(set)
for row in rows_all:
    ep = v(row,'antigen.epitope'); mhca = v(row,'mhc.a')
    if ep and mhca: key_mhc[ep].add(mhca)

# Flag epitopes with distinct HLA genes (not just allele sub-typing)
for ep, alleles in sorted(key_mhc.items(), key=lambda x: -len(x[1])):
    genes = set()
    for a in alleles:
        m = re.match(r'(HLA-[A-Z0-9]+|H2-\w+)', a)
        if m: genes.add(m.group(1))
    if len(genes) > 1:
        print(f"{ep:20} {len(alleles)} alleles, {len(genes)} genes: {sorted(genes)}")

# Flag allele resolution inconsistencies (same gene, coarse + fine)
for ep, alleles in key_mhc.items():
    coarse = {a for a in alleles if '*' in a and ':' not in a}
    fine   = {a for a in alleles if ':' in a}
    if coarse and fine:
        coarse_g = {a.split('*')[0] for a in coarse}
        fine_g   = {a.split('*')[0] for a in fine}
        if coarse_g & fine_g:
            print(f"RESOLUTION MIX {ep}: coarse={sorted(coarse)[:2]} fine={sorted(fine)[:2]}")

Step 7 — Summary Report

=== VDJDB DUPLICATE AUDIT SUMMARY ===
Total rows: N
Total chunks: N

BETA-ONLY DUPLICATES
  Unique duplicate groups: N
  Total redundant rows:    N
  Within-file:             N groups
  Cross-file:              N groups
    Same-lab (≥3 shared authors): N
    Independent replication:       N

TOP CROSS-PUBLICATION PAIRS (by shared CDR3b+Vb+epitope groups):
  [N]  PMIDX × PMIDY — [shared authors count] common authors — [lab relationship]
  ...

HIGH-FREQUENCY SUSPECTS (≥50 copies, ≤3 donors, single file):
  [N]  CDR3b / epitope — file — likely cause

MULTI-MHC EPITOPES (distinct HLA genes):
  [N epitopes] — list top cases

MHC FORMAT ISSUES (allele resolution inconsistency):
  [N epitopes] — coarse vs fine allele mix

RECOMMENDATIONS:
  - Flag high-frequency within-file duplicates in the DB release notes
  - Resolve coarse/fine allele mix per paper (check original publication)
  - Same-lab cross-publication duplicates: expected, keep all records
  - Independent cross-publication public clonotypes: expected, keep all records

Known Findings from 2026-05-31 Audit

Cross-publication duplicate patterns

PMID pair	Shared groups	Shared authors	Relationship
PMID:37749325 × PMID:40694338	921	17 (Kedzierska K et al.)	Same lab, follow-up study
PMID:28423320 × PMID:37749325	171	0	Public clonotypes (GIL, CMV)
PMID:23267020 × PMID:24512815	126	3 (Koning D, van Baarle D)	Same lab, method comparison
PMID:35589842 × PMID:40713946	95	0	Neoantigen public clonotypes
PMID:28250417 × PMID:28629751	38	4 (Selin LK et al.)	Same lab, influenza repertoire
PMID:18802118 × PMID:21562156	35	7 (Kalams SA et al.)	Same lab, HIV studies
PMID:19017975 × PMID:21135165	31	7 (Price DA, Douek DC)	Same lab, longitudinal HIV/CMV

Interpretation: The dominant source of cross-publication duplicates is same-lab follow-up publications. True independent replication (0 shared authors) reflects genuinely public clonotypes for immunodominant epitopes (GIL, GLCTLVAML, NLVPMVATV).

Most replicated epitopes (cross-publication)

Epitope	Cross-pub duplicate rows	Notes
GILGFVFTL	7,436	Influenza GIL — the most public CD8 epitope in humans
GLCTLVAML	1,032	EBV BMLF1 — highly public, 10+ studies
FLRGRAYGL	546	EBV EBNA3A — restricted by HLA-B08 AND HLA-A02:01 (genuine bi-restriction)
NLVPMVATV	377	CMV pp65 — dominant CMV epitope
YLQPRTFLL	298	SARS-CoV-2 Spike — multiple COVID-19 cohort studies

Spurious high-frequency records

Extremely frequent within-file records (≥100 copies, ≤3 donors) reflect sequencing read depth, not individual T cell clones. Known cases:

PMID_41315082: CASSEAGTGGYEQYF/VEALYLVCG n=530, 2 donors — high-throughput scTCR-seq
PMID_39746936: 5 CDR3b/VISNDVCAQV combinations n=160–271, 1 donor
PMID_34811538: Multiple CDR3b n=107–241, 2–3 donors — EBV/beta-cell antigen study
10xgenomics-2019-07-09: n=100–133, 2 donors — multiplexed 10x Genomics data

These are not data errors but should be noted in release documentation.

Genuine multi-MHC restriction

Epitope	MHC genes	Notes
FLRGRAYGL	HLA-A + HLA-B	Published cross-restriction A02 and B08:01
RAKFKQLL (EBV BZLF1)	HLA-A + HLA-B	Atypical; primary restriction is B08; A02 entries warrant review
RPPIFIRRL	HLA-A + HLA-B	Cross-restriction A02 and B07 — published

MHC notation fixes applied (2026-05-31)

H-2Db/H-2Kb/H-2Kd/H-2Ld/H-2Dd → H2-Db/H2-Kb/H2-Kd/H2-Ld/H2-Dd (3,001 rows: strip hyphen between H and 2)
H-2KB → H2-Kb (capitalization fix)
H2 class I + SIINFEKL → H2-Kb (OVA/C57BL6 context)
H2-b class I + SIINFEKL → H2-Kb
H2-b class II + SIINFEKL → removed (MHC-I epitope mislabeled as class II)
H2 class II + Ins2/SHLVEALYLVCGERG → H2-IAg7 (NOD mouse T1D context)
H2-d class II + SFERFEIFPKE → H2-IEd (BALB/c HA restriction)