dpla-community-webs-ingest

star 33

Run Community Webs ingest from SQLite DB. Use when the user says harvest community-webs, run community-webs ingest, export community webs, or process community webs DB.

dpla By dpla schedule Updated 2/27/2026

name: dpla-community-webs-ingest description: Run Community Webs ingest from SQLite DB. Use when the user says harvest community-webs, run community-webs ingest, export community webs, or process community webs DB.

DPLA Community Webs Ingest

Purpose

Run the Community Webs ingest workflow: export SQLite DB → JSONL → ZIP → harvest → (optional) full pipeline. Internet Archive sends a SQLite database; the scripts handle the intermediate export step automatically.

When to Use

  • "Harvest community-webs"
  • "Run community-webs ingest"
  • "Export community webs DB"
  • "Process community webs"
  • "Ingest community-webs from database"

Environment: Source .env before running so JAVA_HOME, DPLA_DATA, I3_CONF are set. Scripts that source common.sh load $I3_HOME/.env when present. Full checklist: AGENTS.md § Environment and build.

Workflow

  1. Place the DB in $DPLA_DATA/community-webs/originalRecords/ (or provide path via --db=).
  2. Run export + harvest (or full pipeline):
    • Export + harvest only: ./scripts/harvest/community-webs-ingest.sh
    • Full pipeline (harvest + map + enrich + jsonl): ./scripts/harvest/community-webs-ingest.sh --full
    • Update i3.conf with output path: ./scripts/harvest/community-webs-ingest.sh --full --update-conf
  3. Verify outputs: _SUCCESS in harvest/mapping/enrichment/jsonl dirs.

Scripts

Script Purpose
community-webs-export.sh Export DB → JSONL → ZIP; validates schema before zipping
community-webs-ingest.sh Orchestrates export + harvest (+ optional full pipeline)

Options

community-webs-export.sh:

  • --db=PATH — Explicit DB path (default: auto-detect latest *.db in originalRecords)
  • --update-conf — Update i3.conf community-webs.harvest.endpoint
  • --skip-validate — Skip JSONL schema validation (not recommended)

community-webs-ingest.sh:

  • --db=PATH — Pass to export script
  • --skip-export — Use existing ZIP (endpoint must already point to it)
  • --full — Run harvest + mapping + enrichment + jsonl
  • --update-conf — Update i3.conf with export output directory

Output

  • Export: $DPLA_DATA/community-webs/originalRecords/<YYYYMMDD>/community-webs-<timestamp>.zip
  • Harvest: $DPLA_DATA/community-webs/harvest/
  • Full pipeline: mapping, enrichment, jsonl under $DPLA_DATA/community-webs/

Validation

The export script runs community-webs-validate-jsonl.py on the JSONL before zipping. It checks:

  • Each record has required field id
  • Each line is valid JSON
  • Records with "status":"deleted" are skipped (harvester filters these)

Tests: ./venv/bin/python -m pytest scripts/tests/test_community_webs_validation.py -v

Critical Rules

  • DB location: Place *.db in $DPLA_DATA/community-webs/originalRecords/ or use --db=/path/to/file.db
  • Prerequisites: sqlite3, jq (brew install / apt install)
  • Output path: i3.conf community-webs.harvest.endpoint must point to directory containing the ZIP

Key References

Resource Path
Script reference scripts/SCRIPTS.md
Ingest docs README_INGESTS.md
Agent guide AGENTS.md
Install via CLI
npx skills add https://github.com/dpla/ingestion3 --skill dpla-community-webs-ingest
Repository Details
star Stars 33
call_split Forks 12
navigation Branch main
article Path SKILL.md
More from Creator