name: document-processing description: Turn unstructured documents into validated, auditable structured data with clear schemas and edge-case handling
Document Processing
Use this skill when you need to turn unstructured or semi-structured documents (PDFs, scans, images, HTML, emails, DOCX) into structured outputs (JSON/CSV) with validation, provenance, and clear error handling.
First Questions (Fail Fast)
Before proposing an approach, clarify:
- Document types: invoices, contracts, medical records, policies, resumes, reports, etc.
- Input format: native PDF vs scanned images; single vs multi-file batches; average pages.
- Extraction goal: which fields, tables, entities, and relationships matter.
- Output contract: JSON schema or CSV columns; required vs optional fields; nullability rules.
- Quality bar: acceptable error rate; whether human review is required; SLA and throughput.
- Privacy constraints: PII/PHI handling, retention limits, and redaction requirements.
If any of these are unknown, state assumptions explicitly and propose options.
Output Contract (Recommended)
Prefer an explicit schema and provenance so consumers can trust the results.
Suggested JSON shape
{
"document_id": "string",
"document_type": "string",
"source": {
"filename": "string",
"page_count": 12
},
"extracted": {
"fields": {
"vendor_name": "string",
"invoice_number": "string",
"total_amount": 123.45,
"currency": "USD"
},
"tables": [
{
"name": "line_items",
"rows": [
{ "description": "string", "quantity": 1, "unit_price": 1.23, "amount": 1.23 }
]
}
]
},
"provenance": [
{
"path": "extracted.fields.total_amount",
"evidence": "snippet or normalized text",
"location": { "page": 2, "bbox": [0, 0, 0, 0] },
"confidence": 0.92
}
],
"validation": {
"status": "pass",
"errors": []
}
}
Rules:
- Never invent missing facts. Use
nullor omit optional fields, and add a validation error. - Normalize units and formats (dates, currencies, decimals) and record the normalization.
- Carry provenance (page/section reference) for high-impact fields.
Workflow
- Define the extraction schema: required fields, types, enums, and table shapes.
- Segment the document: pages/sections; detect headers/footers; isolate tables.
- Parse/OCR:
- Native PDFs: text extraction first, OCR only where needed.
- Scans/images: OCR with layout awareness (reading order matters).
- Extract:
- Use deterministic parsing where possible (regexes, anchors, table parsers).
- Use model-based extraction for ambiguous layouts, but validate aggressively.
- Validate:
- Type checks, required fields, cross-field constraints (totals, sums, date ranges).
- Emit
validation.errors[]with precise paths and user-actionable messages.
- Quality gates:
- If confidence is low or validation fails, route to human review.
- Export:
- JSON/CSV conforming to the agreed contract.
- Include a stable
document_idand processing metadata.
Edge Cases Checklist
Handle explicitly:
- Rotated pages, mixed orientations, skewed scans
- Multi-column layouts, footnotes, wrapped table cells
- Split tables across pages, repeated headers, totals rows
- Missing currency symbols, thousands separators, negative numbers
- Multiple documents inside one PDF, or attachments embedded in emails
- Non-English text, locale-specific dates and decimals
Security & Privacy
- Minimize retention: only store what the pipeline must keep.
- Avoid dumping full documents into logs; log hashes and minimal snippets.
- If the workflow touches sensitive data, pair this with
/security-review.
Related Skills
/planfor clarifying requirements and sequencing delivery/architectfor end-to-end pipeline design and interfaces/iterative-retrievalwhen context must be gathered progressively/security-reviewfor threat modeling and sensitive-data handling