ingest - SKILL.md Agent Skill

name: ingest description: "Pull data out of raw files (PDF, XLSX, CSV/TSV, JSON, HTML, Markdown, plain text), clean each one, and converge everything into one consistent dataset ready for analysis. Use when you have a pile of messy files to extract and normalize. Scales from one file to many via an explicit parallelism ladder."

Ingest Skill

Turn a pile of raw, messy files into one clean, consistent dataset. This is the first stage of a data pipeline: extract → clean → converge (normalize). When you have many files, you extract them in parallel, then bring the results together into a single aligned shape that the next stage can analyze.

Always announce at start: "I'm using the ingest skill to extract and normalize the data."

Work from: the project directory (provided by whoever invoked this skill).

What This Skill Produces

Per-file extracted artifacts in .pipeline-cache/extracted/
One (or a small number of) normalized, analysis-ready artifact(s) in .pipeline-cache/normalized/
A short manifest of what came in, what came out, and any warnings

Persistence (Environment-Adaptive)

The manifest is the canonical output in every environment — always render it inline (the Report Format block below) so whoever invoked the skill sees what came in, what came out, and any warnings.

The pipeline-cache artifacts are the persistence layer, and the mechanism adapts to the environment:

When a durable working directory is available (e.g., opencode): write the extracted and normalized artifacts to the .pipeline-cache/ paths above and reference them by path. Subsequent stages read them from there.
When no durable filesystem is available (e.g., a Claude organizational skill): code execution still works — produce the same artifacts in the working sandbox and carry them forward to the analyze stage within the session, but do not assume a stable path like .pipeline-cache/ persists or is user-visible. Surface the normalized dataset inline (and as downloadable/copyable output if the environment supports it) rather than pointing at a path.

Either way the shape of the output is identical — the same extracted, then normalized dataset and the same manifest; only where (and whether) it lands on disk changes.

Non-Negotiable Rules

Never mutate source files. Raw inputs are read-only. Every output goes to a new file.
No silent installs. If a required package isn't available, surface it to the user — don't install it silently.
Document judgment calls. Anything that required interpretation (ambiguous columns, how missing values were handled) gets noted so the analysis stage and the user can see it.
Cache, don't reprocess. If a normalized artifact already exists from earlier in the same session, reuse it.

Step 1: Extract (per file)

Each input file is extracted by the appropriate parser into a workable form.

Format	Tool	Output → `.pipeline-cache/extracted/`
PDF	`pdfplumber` or `pypdf`	`<basename>.json` (text + metadata)
XLSX/XLS	`openpyxl` / `pandas`	`<basename>.csv` (per-sheet, merged where sensible)
CSV/TSV	`pandas`	`<basename>.csv` (cleaned)
JSON	`json` / `pandas`	`<basename>.json` (schema-checked if applicable)
HTML tables	`BeautifulSoup` / `pandas.read_html`	`<basename>.csv`
Markdown tables	`pandas.read_html` or manual parsing	`<basename>.csv`
Plain text	shell (`sed`, `awk`) or Python	`<basename>.txt` (cleaned)

Each per-file extraction unit does exactly this:

Read the source file with the right parser.
Extract the structured data.
Clean basic formatting issues (whitespace, encoding).
Save to .pipeline-cache/extracted/<basename>.<ext>.
Record: fields extracted, row count, and any warnings (missing/partial data).

Write extraction scripts to scripts/ by default. They are session-scoped and throwaway unless explicitly promoted (see Tooling & Reuse).

Step 2: Choose Your Parallelism Rung

Extraction across multiple files is independent work — different files, no shared state — so it parallelizes cleanly. Match the mechanism to the scale. Don't over-engineer trivial extractions.

Scale	Rung	Mechanism
Single file	Inline	Just extract it. No parallelism.
2–5 files, same format	In-process concurrency	One script, a thread/process pool (e.g. `concurrent.futures.ThreadPoolExecutor`) iterating the files.
2–5 files, mixed formats	Shell fan-out	One script per format, run concurrently: `python extract_pdfs.py & python extract_xlsx.py & wait`.
Many files, or slow/heavy processing	Sub-agent fan-out	Dispatch one worker per file (or per batch) so each gets isolated context and its own format tooling.

Climb only as high as the work justifies. A single CSV needs no pool. Three PDFs do not need sub-agents. Reserve sub-agent fan-out for genuinely heavy or numerous inputs where isolated context actually helps.

Note for orchestrating agents

If an orchestrator is driving this skill, it may dispatch the per-file extraction unit (Step 1) as parallel sub-agents instead of running the in-process pool — that's the top rung above. Either way the output contract is identical: extracted artifacts land in .pipeline-cache/extracted/, and Step 3 converges them. Whoever fans out owns waiting for all units to finish before Step 3.

Step 3: Converge & Normalize

This is a convergence step — every extraction unit must finish before it begins. Bring all extracted artifacts into one consistent, analysis-ready shape.

Handle the common messes:

Inconsistent column names → align to one naming scheme
Mixed date/number formats → normalize to one representation
Whitespace, encoding problems → clean
Merged cells, multi-header rows → flatten
Missing values → handle explicitly (and document how)

Save the result(s) to .pipeline-cache/normalized/<name>.csv (or .json where that fits the data better). If multiple inputs describe the same entity, join or stack them here so analysis works from one source.

Document every normalization decision that required judgment. The analysis stage and the user both need to trust the clean dataset.

Tooling & Reuse

Write scripts freely. Wrangling data with a script is expected and encouraged.
Python is the natural default. Prefer common libs: pandas, openpyxl, pdfplumber, csv, json, tabulate. Avoid exotic dependencies; call it out when one is genuinely necessary.
Session-scoped by default. Scripts in scripts/ are throwaway unless promoted.
Promotion path. If you write substantially the same utility more than once in a session, consider extracting it into a named, documented function. Judgment call — do it when it saves real effort, and document the promotion.

When You're in Over Your Head

It is always OK to stop and surface a problem rather than guess. Bad data is worse than no data.

Stop and report when:

A source file is corrupt, encrypted, or in a format you can't parse
The data is too ambiguous to normalize without a decision only the user can make
A required package is missing (surface it — never silently install)

Report Format

When done, report using this format:

Status: DONE | DONE_WITH_CONCERNS | BLOCKED

Inputs:
- <file> (<format>, <rows/pages>)

Extracted artifacts:
- .pipeline-cache/extracted/<name> (<rows>, <warnings or "clean">)

Normalized output:
- .pipeline-cache/normalized/<name> (<rows> × <cols>)

Normalization decisions:
- [judgment calls made, or "none — clean passthrough"]

Concerns:
- [anything worth flagging, or "none"]

Hand off to the analyze skill (or back to the orchestrating agent) once the normalized dataset exists.