name: dpla-ingest-debug description: Debug and fix DPLA hub ingestion failures (harvest/mapping/enrichment/jsonl/s3-sync/anomaly). Use when user asks why a hub failed, to debug an ingest failure, check an escalation report, or retry a failed hub/stage.
DPLA Ingest Debugging
Goal
Quickly identify what stage failed, find the relevant logs/escalation report, apply a targeted fix, and re-run only the necessary steps.
Environment
For any commands that depend on project env (especially the orchestrator), run source .env first so JAVA_HOME, DPLA_DATA, I3_CONF, and SLACK_WEBHOOK are available.
Gather Context (fast path)
- If this was an orchestrator run, start with the escalation report (if present):
ls -lt data/escalations/ | head
- Check current per-hub status (orchestrator writes these continuously):
./scripts/status/ingest-status.sh --watch
# or for one hub:
./scripts/status/ingest-status.sh <hub>
- Find recent logs:
ls -lt logs/ | head
Classify the Failure Stage
Look for one of: harvest, mapping, enrichment, jsonl, sync, anomaly.
If you have only a hub name and "it failed", the quickest approach is:
- check
logs/status/<hub>.status(JSON), and - check the newest matching log in
logs/.
Before Re-running: Resolve harvest type and runbook
Before re-running any scripts, determine the correct runbook for the hub:
- Read harvest type from i3.conf:
grep "^<hub>\.harvest\.type" "$I3_CONF"(or use the dpla-hub-info skill). - Choose the matching runbook for that harvest type:
file,api,localoai, or special-case runbooks (Smithsonian preprocessing, NARA delta workflow). See runbooks/README.md. - Only then run harvest.sh, remap.sh, s3-sync.sh, or orchestrator retries — do not apply a generic rerun that's inappropriate for the hub type.
Re-run the Minimal Steps
All commands below assume you're running from repo root and that you've resolved harvest type and runbook (see above).
Harvest failed:
./scripts/harvest.sh <hub>
Mapping/enrichment/jsonl failed (re-run remap):
./scripts/remap.sh <hub>
S3 sync failed (or you want to re-sync after a successful run):
./scripts/s3-sync.sh <hub>
Orchestrator retry (failed hubs from last run):
./venv/bin/python -m scheduler.orchestrator.main --retry-failed
Orchestrator retry (one hub):
./venv/bin/python -m scheduler.orchestrator.main --hub=<hub>
Common Issues / Fix Heuristics
- Timeout / unreachable feed: retry harvest; if repeated, confirm the endpoint in i3.conf and capture the exact error line for escalation.
- Missing input (mapping complains about harvest input): harvest didn't produce output → re-run harvest, then remap.
- sbt contention / orphan processes: ensure the fat JAR is built (
sbt assembly) before running pipeline scripts. If you must intervene, inspect before killing:pgrep -fl 'sbt|java.*ingestion3'- then
kill <pid>(avoid broadpkillpatterns unless you're sure).
- Smithsonian / NARA special workflows: do not improvise; follow the dedicated runbooks (Smithsonian preprocessing, NARA delta workflow).
Verify Output
After re-running, verify _SUCCESS markers and counts for all pipeline steps (harvest, mapping, enrichment, jsonl):
ls "$DPLA_DATA/<hub>/harvest"/*/_SUCCESS
ls "$DPLA_DATA/<hub>/mapping"/*/_SUCCESS
ls "$DPLA_DATA/<hub>/enrichment"/*/_SUCCESS
ls "$DPLA_DATA/<hub>/jsonl"/*/_SUCCESS
cat "$DPLA_DATA/<hub>/mapping"/*/_SUMMARY | head
cat "$DPLA_DATA/<hub>/enrichment"/*/_SUMMARY | head
The pipeline is not successful if enrichment failed; ensure all four stages have _SUCCESS markers.