name: sitrep-extract description: "Extract epidemiological tables from MVE SitRep PDFs using Claude Vision API. Use when: running extraction on new PDFs; Claude API errors (auth, rate limit, timeout, JSON parse failure); understanding extraction output files; re-extracting a specific PDF; diagnosing missing, wrong, or rejected table data; data/processed/processed.json or master CSVs are out of sync."
SitRep Extract
Calls the Claude Vision API to extract case count tables from MVE SitRep PDFs, producing per-sitrep CSVs and appending to master linelists.
When to Use
- After
sitrep-fetchreports new PDFs downloaded - To re-extract a specific PDF after fixing a parsing issue
- Diagnosing Claude API errors or output validation warnings
Commands
Update mode — process all new PDFs in the archive:
cd /Users/billyquilty/Documents/Work/bvd_sitrep_extractor
set -a && source .env && set +a
python3 scripts/extract_sitrep.py --update
Single PDF (force re-extract, even if already in processed.json):
python3 scripts/extract_sitrep.py data/raw/<name>/MVE_SitRep_NNN_YYYY-MM-DD.pdf
Expected Output
══════════════════════════════════════════════════════
3 new SitRep(s) processed
New rows added : 45
Master total : 212 rows
══════════════════════════════════════════════════════
Outputs written to: data/processed/epicentre_format/
Output Files
| File | Description |
|---|---|
data/processed/epicentre_format/<name>/new_cases_counts.csv |
New cases table from this sitrep |
data/processed/epicentre_format/<name>/cumulative_counts.csv |
Cumulative cases table |
data/processed/epicentre_format/<name>/combined_counts.csv |
Merged new + cumulative |
data/processed/epicentre_format/<name>/response_counts.csv |
Patient movement summary (if present) |
data/processed/epicentre_format/<name>/poe_counts.csv |
Points d'Entrée summary (if present) |
data/raw/<name>/raw_extraction.json |
Raw Claude JSON response (debug) |
data/processed/master_combined_counts.csv |
Master linelist — all sitreps combined |
data/processed/master_response_counts.csv |
Master patient movement data |
data/processed/master_poe_counts.csv |
Master Points d'Entrée data |
data/processed/processed.json |
Registry of extracted PDFs (keys are canonical PDF filenames) |
Error Table
| Error | Cause | Fix |
|---|---|---|
ANTHROPIC_API_KEY environment variable not set |
.env not sourced |
Run set -a && source .env && set +a before the script |
anthropic.AuthenticationError |
API key invalid or revoked | Verify key at console.anthropic.com; update .env |
anthropic.RateLimitError |
API quota exceeded | Wait a few minutes and retry; if persistent, check plan limits |
anthropic.APITimeoutError (second attempt) |
API unresponsive | Script retries once automatically; if still failing, wait and retry the whole command |
ValueError: Failed to extract valid JSON after repair |
Claude returned unparseable output for a PDF | Inspect data/raw/<name>/raw_extraction.json; re-run single-PDF extraction: python3 scripts/extract_sitrep.py data/raw/<name>/<name>.pdf |
WARNING: Rejecting cumulative table — title contains 'alertes' |
Heuristic detected an alerts/investigation table instead of a cases table | Expected for some sitreps; check combined_counts.csv manually to confirm new cases were still captured |
WARNING: Probable cases > 3× suspected |
Alerts table heuristic triggered | Same as above — verify combined_counts.csv is correct |
FileNotFoundError: data/raw/<name>/MVE_SitRep_NNN...pdf |
PDF not in archive | Run sitrep-fetch first; verify with ls data/raw/ |
PDF already in processed.json, skipped |
Script skips already-extracted PDFs in update mode | Use single-PDF mode to force re-extract |
| Master CSV has duplicate rows | Single-PDF mode run after update mode | Deduplicate: python3 -c "import pandas as pd; df=pd.read_csv('data/processed/master_combined_counts.csv'); df.drop_duplicates().to_csv('data/processed/master_combined_counts.csv', index=False)" |
Notes
- Model:
claude-sonnet-4-6by default; override withANTHROPIC_MODELin.env - Two-pass extraction: (1) JSON tables, (2) full-text transcription for context
data/processed/processed.jsonkeys are canonical PDF filenames (e.g.,MVE_SitRep_002_2026-05-18.pdf)- The extraction folder for SitRep 002 is
data/processed/epicentre_format/MVE_SitRep_002_2026-05-20/(date from original download's Last-Modified header) — this is correct and expected ANTHROPIC_API_KEYmust be in environment; it is never read from.envautomatically — always source.envfirst