name: dpla-run-ingest description: Execute a single-hub or manual ingest by following the correct runbook and scripts. Use when the user says run ingest for [hub], harvest [hub], remap [hub], or run the full pipeline for [hub]. Ensures harvest type is identified, correct runbook and scripts are used, and outputs are verified.
DPLA Run Ingest
Purpose
Run an ingest for a specific hub using the right runbook and scripts, then verify output. Use this for single-hub or manual runs (for multi-hub/scheduled runs, use the orchestrator instead).
When to Use
- "Run ingest for [hub]"
- "Harvest [hub]"
- "Remap [hub]"
- "Full pipeline for [hub]"
- "Start harvest for [hub]"
Environment: Scripts that source common.sh (harvest.sh, ingest.sh, remap.sh, etc.) automatically load $I3_HOME/.env when present, so JAVA_HOME, DPLA_DATA, I3_CONF, SLACK_WEBHOOK, etc. are set before the JAR is built or the pipeline runs. You do not need to run source .env separately. Full checklist: AGENTS.md § Environment and build.
Checklist
- JAR is built automatically: When you run
./scripts/harvest.sh(or ingest.sh, remap.sh, etc.),run_entryin common.sh runssbt assemblyif the JAR is missing or if any Scala source is newer than the JAR. So "harvest indiana" will use current code without a separate build step. (You can still runsbt assemblyfirst to avoid a build delay on the first harvest.) - Identify the hub (e.g. from the user message).
- Get harvest type from i3.conf (
$I3_CONF, default~/dpla/code/ingestion3-conf/i3.conf):<hub>.harvest.type. Values:localoai,api,file,nara.file.delta. - Pick the runbook: See runbooks/README.md for harvest-type to runbook mapping.
- Run the scripts from the runbook (see scripts/SCRIPTS.md). Examples:
- Full pipeline:
./scripts/ingest.sh <hub> - Harvest only:
./scripts/harvest.sh <hub> - Remap (mapping + enrich + jsonl):
./scripts/remap.sh <hub> - NARA:
./scripts/harvest/nara-ingest.sh <nara-export.zip>
- Full pipeline:
- Verify outputs:
_SUCCESSin the step output dirs;_MANIFEST/_SUMMARYfor counts. - S3 sync when the runbook says so:
./scripts/s3-sync.sh <hub>. - On failure: Post to #tech-alerts or email tech@dp.la with hub, stage, and error or path to escalation report.
Critical Rules
- NARA / Smithsonian: Do not run the standard ingest pipeline without their dedicated runbooks (NARA: delta merge; Smithsonian: preprocessing e.g. fix-si.sh).
- Output path: All Scala
--outputmust be$DPLA_DATA(the data root), never$DPLA_DATA/<hub>. Scripts handle this. OutputHelper builds paths asrootPath / shortName / activity / timestamp-schema. - Python/scripts: Use
./venv/bin/pythonfor Python; use./scripts/scripts from repo root. AWS:--profile dpla.
Verification Commands
ls $DPLA_DATA/<hub>/harvest/<timestamped-dir>/_SUCCESS
ls $DPLA_DATA/<hub>/mapping/<timestamped-dir>/_SUCCESS
ls $DPLA_DATA/<hub>/jsonl/<timestamped-dir>/_SUCCESS
cat $DPLA_DATA/<hub>/harvest/<timestamped-dir>/_MANIFEST
cat $DPLA_DATA/<hub>/mapping/<timestamped-dir>/_SUMMARY
Incomplete runs (e.g. _temporary but no _SUCCESS) should be deleted before retrying.
Before/after checklist
Before running:
- If the run will use the pipeline, run
sbt assemblyso the fat JAR reflects the current code (or confirm no Scala changes since last build). - Confirm hub and harvest type; open the correct runbook (or scripts/SCRIPTS.md if runbooks are not yet available).
- If using the orchestrator, ensure
SLACK_WEBHOOKis set (or plan to email tech@dp.la on failure).
After a run:
- If any hub failed: post failure summary to #tech-alerts or email tech@dp.la; include stage and reference escalation report if present.
- If the run completed: completion notification is sent by the orchestrator when applicable; if you ran only scripts, consider notifying status to #tech-alerts or tech@dp.la if that's standard.
Resuming failed steps
- Check which steps completed (look for
_SUCCESSfiles inharvest/,mapping/,enrichment/,jsonl/). - Re-run only the failed step and later steps. Example: if mapping succeeded but enrichment failed, run
./scripts/enrich.sh <hub>then./scripts/jsonl.sh <hub>. - For the full pipeline (IngestRemap), it must be re-run from scratch since it does mapping+enrichment+jsonl in one Spark application.
Key References
| Resource | Path |
|---|---|
| Runbook index and mapping | runbooks/README.md |
| Script reference | scripts/SCRIPTS.md |
| Agent guide | AGENTS.md |
| Config | i3.conf at $I3_CONF |