ocr-benchmark

name: ocr-benchmark description: Use to MEASURE OCR hero-card / parse accuracy against the 7k pokercraft ground-truth corpus — establishing a baseline, iterating on hero/board localization or CardCNN changes, or quantifying a fix's before/after. Triggers on "benchmark the OCR", "run the OCR benchmark", "hero localization accuracy", "train/test eval", "how accurate is the parse", "did this change help/hurt OCR". For the hard pre-commit no-regression GATE, use verify-ocr-no-regression instead.

OCR Benchmark

Measure image→hand recognition accuracy against authoritative labels. This skill is for iteration and measurement; the commit-blocking gate is the separate verify-ocr-no-regression skill (run that before you ship).

Ground truth corpus (7183 hands)

Images: data/hand_images/img/<hand_id>.png
Labels: data/pokercraft_corpus/ground_truth/ground_truth.jsonl (HH-derived, authoritative; each line {hand_id, ground_truth:{hero_hand, hero_position, board, preflop_actions, ...}})
Train/test split: data/splits/production_v1.json (production_train / production_val / production_test)

Important corpus caveat: the pokercraft replay images are CLEAN (~~4.8% degenerate hero crops). Live N8 mobile Telegram screenshots are much harder (~~26% blocking, WIN stickers / chips / flags / variable framing) and are NOT in this corpus. Use the corpus for correctness (authoritative labels); use the live mode below (DB snapshots) for the real-traffic speed/blocking metric.

Fast hero-localization bench (minutes) — `scripts/ocr_hero_loc_bench.py`

Isolates _locate_hero_cards + CardCNN (skips panel EasyOCR and the Gemini fallback). Use it to iterate on hero detection and prove a change is net-positive before the full gate.

# 1. baseline on main (git stash / checkout the base first):
python scripts/ocr_hero_loc_bench.py run old 2000
# 2. candidate on your branch:
python scripts/ocr_hero_loc_bench.py run new 2000
# 3. diff — SHIP ONLY IF REGRESS == 0 and FIXED > 0:
python scripts/ocr_hero_loc_bench.py cmp old new
# real-traffic blocking-Gemini rate (label-free, from DB snapshots):
python scripts/ocr_hero_loc_bench.py live 200

run reports hero accuracy, local-emit rate (conf≥0.70), blocking rate (conf<0.70). Results saved to data/_ocr_loc_eval/<label>.jsonl (gitignored). Single-process by design — see the CUDA note below.

Full accuracy benchmark (slower, full production parse incl. Gemini)

scripts/ocr_precision.py runs the real parse_n8_screenshot over the corpus and reports per-field accuracy + a confidence/coverage curve, train/test aware.

# full 7k (no --split → all paired hands):
python scripts/ocr_precision.py --workers 12 --dump-all --out data/ocr_prec
# held-out test bucket only:
python scripts/ocr_precision.py --split data/splits/production_v1.json \
    --bucket production_test --out data/ocr_prec_test

Read data/ocr_prec/all_records.jsonl for per-hand fields.hero_hand, card_confidence, etc. (--dump-all writes every record).

scripts/ocr_benchmark.py + scripts/ocr_regression_gate.py are the base-vs-head gate used by verify-ocr-no-regression.

CUDA gotcha (read before running with workers)

Multiprocess GPU dies: ocr_precision.py --workers N on the GPU throws CUBLAS_STATUS_NOT_INITIALIZED (N CUDA contexts on one device). Either run it CPU-only (CUDA_VISIBLE_DEVICES="" python scripts/ocr_precision.py ...) or use the single-process ocr_hero_loc_bench.py for GPU runs.

Decision rule

Net-positive is not enough for a localized fix: any hand correct-before / wrong-after is a regression — block it (cmp REGRESS must be 0). Then run the full verify-ocr-no-regression gate before commit/merge.

verify-ocr-no-regression (the commit gate — always run before shipping an scripts/ocr/** change), retrain-card-classifier, fix-hand, snapshot_test.py (L1/L2 deterministic layer).