name: document-extractor description: Extract structured data from documents that resist standard parsing, such as redacted records, scanned forms, inconsistent tables, and OCR artifacts. Use this skill when a journalist needs to transform messy PDFs or images into structured JSON with full provenance tracking. Triggers on requests involving FOIA documents, court records, financial disclosures, government forms, leaked documents, or any document described as "hard to parse," "scanned," "redacted," or "inconsistent."
Document Extractor for Investigative Journalism
Extract structured data from messy documents while maintaining provenance and human oversight.
Workflow Overview
- Convert → Transform document pages to images
- Transcribe → Read each page image, output markdown preserving structure
- Stitch → Combine markdown files with page delineators
- Schema → Propose extraction schema(s), await journalist approval
- Extract → Transform markdown to JSON using approved schema
Step 1: Convert Document to Images
Run the conversion script:
uv run --with pdf2image --with pillow scripts/convert_to_images.py input.pdf --output-dir ./pages --dpi 300
Output: pages/page_001.png, pages/page_002.png, etc.
For image-based documents (TIFF, scanned images), copy directly to the pages directory with sequential naming.
Step 2: Transcribe Each Page
NOTE: For large files (>50 pages), you need to use automated tooling, rather than reading all pages yourself. Refer to references/automated-extraction.md for details on that workflow. Otherwise, read and transcribe EVERY page yourself. Don't skip any.
Read the image file for each page in parallel. For each page image, output a markdown file preserving:
- Layout: Use tables, headers, indentation to mirror document structure
- Redactions: Mark as
[REDACTED] - Illegible text: Mark as
[ILLEGIBLE]or[UNCLEAR: partial text?] - Handwriting: Mark as
[HANDWRITTEN: transcription]or[HANDWRITTEN: ILLEGIBLE] - Checkboxes: Use
[X]for checked,[ ]for unchecked - Stamps/signatures: Note as
[STAMP: text]or[SIGNATURE]
Transcription Template
<!-- Page N of document: filename.pdf -->
<!-- Document type: [form/letter/table/mixed] -->
<!-- Quality notes: [any OCR issues, damage, etc.] -->
[Transcribed content here, preserving structure]
Example Transcription
<!-- Page 1 of document: foia_response_2024.pdf -->
<!-- Document type: form -->
<!-- Quality notes: Slight skew, stamp partially cut off -->
# FREEDOM OF INFORMATION ACT REQUEST RESPONSE
**Date:** March 15, 2024
**Case Number:** FOIA-2024-00142
**Requester:** [REDACTED: ~2 words]
## Responsive Documents
| Doc ID | Date | Description | Pages | Disposition |
|--------|------|-------------|-------|-------------|
| A-001 | 2023-01-15 | Email correspondence | 3 | Released in full |
| A-002 | 2023-02-20 | [REDACTED] | 7 | Withheld (b)(6) |
| A-003 | [ILLEGIBLE] | Meeting notes | 2 | Released with redactions |
[STAMP: APPROVED FOR RELEASE - partially visible]
[SIGNATURE]
Save each transcription as transcripts/page_001.md, transcripts/page_002.md, etc.
Step 3: Stitch Transcripts
Combine all page transcripts into a single file:
# Full Document Transcript
**Source:** filename.pdf
**Total Pages:** N
**Processed:** YYYY-MM-DD
---
[Contents of page_001.md]
---
<!-- PAGE BREAK: 1 → 2 -->
---
[Contents of page_002.md]
...
Save as full_transcript.md.
Step 4: Propose Schema(s)
Analyze the transcript and propose one or more schemas. Present to journalist for review.
Schema Proposal Format
## Proposed Extraction Schema(s)
### Schema 1: [Name]
**Applies to:** Pages X-Y (or "all pages," "pages containing tables," etc.)
**Purpose:** [What this schema captures]
| Field | Type | Description | Required | Example |
|-------|------|-------------|----------|---------|
| field_name | string/number/date/boolean/array | What it represents | Yes/No | "example value" |
### Schema 2: [Name]
...
## Open Questions for Review
1. [Question about ambiguous data]
2. [Question about handling edge cases]
3. [Question about field naming preferences]
## Notes
- [Any patterns observed]
- [Potential data quality issues]
- [Recommendations]
Schema Design Principles
See references/schema-patterns.md for detailed guidance. Key principles:
- Flat over nested when possible for easier analysis
- Consistent field names across schemas (use snake_case)
- Always include provenance:
source_page,source_document - Handle missing data explicitly: use
null, not empty strings - Preserve original text alongside normalized values when ambiguous
STOP HERE - Present schema to journalist and await approval before proceeding.
Step 5: Extract to JSON
After journalist approval, transform the markdown transcript to JSON. Do this YOURSELF, not with a script.
Extraction Guidelines
- One JSON file per schema if multiple schemas
- Array of records at the top level
- Include metadata header:
{
"extraction_metadata": {
"source_document": "filename.pdf",
"extraction_date": "2024-03-15",
"schema_version": "1.0",
"total_records": 42,
"notes": ["Any extraction notes"]
},
"records": [
{
"source_page": 1,
"field1": "value1",
"field2": "value2"
}
]
}
- Handle ambiguity transparently:
{
"date": "2024-03-15",
"date_raw": "3/15/24",
"date_confidence": "high"
}
- Mark extraction issues:
{
"name": null,
"name_note": "REDACTED in source",
"amount": 1500,
"amount_note": "Partially illegible, interpreted from context"
}
Output Files
Save to output/ directory:
output/[schema_name].json- Extracted dataoutput/extraction_report.md- Summary of extraction with any issues
File Structure
working_directory/ ├── input.pdf # Original document ├── pages/ # Page images │ ├── page_001.png │ └── ... ├── transcripts/ # Individual page transcripts │ ├── page_001.md │ └── ... ├── full_transcript.md # Stitched transcript ├── schema_proposal.md # Schema for journalist review └── output/ ├── [schema_name].json # Final extracted data ├── extraction_report.md # Extraction summary └── review_[document].html # Interactive review interface
Step 6: Generate Review Interface
After extraction, generate a self-contained HTML review interface:
uv run scripts/generate_review_interface.py ./pages output/extracted.json \
--output output/review_document.html \
--document-name "FOIA Response 2024-001"
This creates a single HTML file the journalist can open in any browser—no server, no installation, no technical setup required.