name: adu-pdf-extraction description: "This skill extracts construction PDF plan binders into agent-consumable formats. It should be used when a contractor or homeowner provides a PDF binder of construction plans (site plans, floor plans, structural drawings, Title 24 reports) that needs to be parsed for permit review, corrections response, or plan check analysis. Produces three outputs: page PNGs for vision analysis, structured markdown per page via vision extraction, and a JSON manifest for routing."
Construction PDF Binder Extraction
Purpose
Extract multi-page construction plan PDF binders into a vision-first structure that enables an AI agent to efficiently navigate, reference, and respond to specific pages and drawing zones within the plans.
Construction PDFs are uniquely challenging because:
- Single PDF pages often contain multiple sub-pages composited together
- Text is rendered by CAD software in non-extractable ways
- Watermarks (e.g., "Study Set - Not For Construction") inject diagonal characters that pollute text extraction
- Drawing content (dimensions, callouts, symbols) carries critical meaning that only vision can interpret
- Title 24 energy reports are often rasterized images, not selectable text
When to Use
Invoke this skill when:
- A PDF binder of construction plans is provided (typically 10-30+ pages)
- Plan check corrections need to reference specific sheets and locations
- A permit checklist needs to be generated from submitted plans
- Any construction document needs to be made queryable by an AI agent
Why Vision-First (with Tesseract Cross-Reference)
Four text extraction methods were tested head-to-head on real construction PDFs.
Vision wins on every page type for structure and layout. See
references/extraction-findings.md for the full comparison data.
| Method | Drawing Pages | Text-Heavy Pages | Rasterized (Title 24) |
|---|---|---|---|
| pdftotext | Garbage | Usable | Empty |
| pdfplumber | Reversed text | Good | 367 chars |
| Tesseract OCR | Garbled | Good | Good |
| Claude Vision | Excellent | Excellent | Excellent |
Vision is the primary extraction method. It handles structure, spatial understanding, watermark transparency, drawing interpretation, and rasterized content reading. No other method comes close on construction PDFs.
Tesseract supplements vision for numeric accuracy. Testing on dense cover sheets revealed that vision at 1568px resolution can hallucinate specific numeric values — "65.0 sq ft" becomes "856", "475 sq ft" gets missed entirely. Tesseract's character-level OCR reliably captures exact digits even on dense pages. The hybrid approach: run both, give subagents both outputs, and cross-reference numbers. On drawing-heavy pages where Tesseract produces garbage, subagents are instructed to ignore it.
Do NOT use pdftotext or pdfplumber for construction PDFs. pdftotext
produces garbage on drawings and empty output on rasterized pages. pdfplumber
produces reversed text (*TEEHS REVOC*) — completely unusable.
At ~1,500 tokens per page PNG, a full 30-page binder costs ~45K tokens for complete vision extraction — trivial at production pricing.
Extraction Process
Step 1: Prepare Output Directory
Create the output directory structure:
{output-dir}/
├── pages-png/ # One PNG per PDF page (resized to 1568px max)
├── pages-text/ # Tesseract OCR text per page (numeric cross-ref)
├── pages-vision/ # Per-page outputs from vision subagents:
│ ├── page-NN.md # Structured markdown (detailed content)
│ └── page-NN.json # Manifest fragment (routing entry)
└── binder-manifest.json # Assembled manifest (routing artifact)
Step 2: Extract Page PNGs + Tesseract Text
Run scripts/extract-pages.sh to split the PDF into page PNGs, resize them
for API consumption, and run Tesseract OCR:
scripts/extract-pages.sh INPUT.pdf OUTPUT_DIR
The script does three things:
- Split PDF — Uses
pdftoppmat 200 DPI. Each page becomespages-png/page-01.png,page-02.png, etc. - Resize PNGs to 1568px max — Claude's API internally resizes images
to 1568px on the longest side. Construction PDFs at 200 DPI on D-size
sheets produce 7200x4800 PNGs — resizing before upload saves bandwidth
and avoids the 32MB API payload limit without losing any information
the model would actually see. Uses ImageMagick (Linux) or
sips(macOS). - Tesseract OCR — Runs
tesseracton each resized PNG to producepages-text/page-01.txt, etc. These raw text dumps supplement vision extraction by providing reliable numeric values for cross-reference. On drawing-heavy pages, Tesseract output will be garbage — subagents are instructed to recognize and ignore it.
Note: CAD-generated construction PDFs commonly produce Poppler warnings like
"Syntax Error: insufficient arguments for Marked Content". These are harmless — the PNGs render correctly. The script suppresses these via2>/dev/null.
If pdftoppm is not available, fall back to ImageMagick:
magick -density 200 input.pdf -quality 90 output-dir/pages-png/page-%02d.png
Then manually resize and run Tesseract on the resulting PNGs.
Step 3: Vision Extract Every Page (Rolling Window)
Vision extraction is the most time-intensive step. Use a rolling window
of parallel subagents — one page per subagent, max 3 in flight at any time.
The full prompt template is in prompts/vision-extract-page.md — read it
and use it as the prompt for each subagent.
Why One Page Per Subagent
Each subagent conversation accumulates every image it reads into the message history. With multi-page batches, by the time the subagent processes page 3, all 3 PNGs are in context for every API call. This causes:
- API image limits: Claude's API enforces a 2000px-per-image cap when
20 images are in the conversation. Construction PNGs at 200 DPI are typically 7200x4800 — well over this limit.
- Token waste: Each additional image in context costs ~1,500 tokens per API round-trip, even when only analyzing the current page.
- Quality degradation: More images in context = more noise for the model.
One page per subagent means exactly one image in context. No multi-image limits, cleaner extraction, and the Tesseract text file provides numeric cross-reference without adding image tokens.
Resource Constraints
Maximum 3 concurrent subagents. One page per subagent.
This is a hard constraint for deployment to Vercel sandboxes (4 GB RAM total). The orchestrator + 3 subagents = 4 processes, each getting ~1 GB RAM. Do not exceed 3 concurrent subagents under any circumstances.
Rolling Window Orchestration
Instead of fixed rounds (launch 3, wait for all 3, launch next 3), use a rolling window: launch 3 subagents, and as each one completes, immediately launch the next. This keeps 3 subagents in flight at all times until all pages are processed.
Task tool parameters (per subagent):
name: "vision-page-NN"
subagent_type: "general-purpose"
mode: "bypassPermissions"
run_in_background: true
prompt: (read from prompts/vision-extract-page.md,
replace {{PAGE_PNG}}, {{TEXT_FILE}}, {{OUTPUT_MD}},
{{OUTPUT_JSON}}, and {{SKILL_DIR}} with actual paths)
- Count the total page PNGs in
pages-png/ - Launch subagents for pages 1, 2, and 3 (3 in parallel)
- As each subagent completes, immediately launch the next page
- Continue until all pages are queued
- Wait for the final subagents to complete
- Verify all
pages-vision/page-NN.mdANDpage-NN.jsonfiles exist
Throughput
| Binder Size | Subagents | Max Concurrent | Approx. Wall Time |
|---|---|---|---|
| 9 pages | 9 | 3 | ~3x single page |
| 15 pages | 15 | 3 | ~5x single page |
| 26 pages | 26 | 3 | ~9x single page |
| 30 pages | 30 | 3 | ~10x single page |
Each subagent takes ~3-4 minutes (read references, read PNG, write .md, write .json). With 3 concurrent, a 26-page binder completes in ~30 minutes.
Output Format
Each subagent writes two files per page to pages-vision/:
page-NN.md— Structured markdown with full extracted content:- Title block identification (sheet number, title, firm)
- All text content (tables, notes, schedules, specifications)
- Spatial zone mapping for every content element
- Drawing descriptions for non-text content
- Confidence annotations for watermark-obscured or low-resolution content
page-NN.json— Manifest fragment for routing:- Page metadata (sheet_id, category, subcategory)
key_contentarray with specific values (guided by extraction priorities)topicskeyword tags for corrections letter matchingdrawing_zonesspatial map"NOT SHOWN: [item]"entries for expected-but-absent content- Cover sheet fragment includes
_projectmetadata
The extraction priorities reference (references/adu-extraction-priorities.md)
guides subagents on what to capture with specificity and what to flag as absent
for each content type. This produces manifest entries targeted for corrections
letter routing without any decision-making about compliance.
See prompts/vision-extract-page.md for the full prompt template including
both output formats.
Step 4: Assemble the Manifest
The manifest is what makes everything else useful. It enables an agent to route to the correct page(s) without loading all pages into context.
Since each vision subagent already wrote a JSON manifest fragment per page (in Step 3), assembly is deterministic — no LLM needed.
Run the assembly script:
python3 scripts/assemble-manifest.py {output}/pages-vision {output}/binder-manifest.json
The script:
- Reads all
page-NN.jsonfragments frompages-vision/ - Extracts
_projectmetadata from the cover sheet fragment - Combines into
{ "project": {...}, "pages": [...] } - Validates required fields and page numbering
- Writes
binder-manifest.json
Exit codes: 0 = clean, 1 = assembled with issues, 2 = fatal error.
The orchestrator MUST always review the assembled manifest (see Step 4a).
The assembly script is deterministic but not smart — it can concatenate JSON
but it cannot catch semantic issues like a wrong category, a vague
key_content entry, or a missing _project field that should have been
extracted. A quick orchestrator read-through catches things the script never
could.
The assembled manifest follows the schema in references/manifest-schema.md.
Each page entry captures:
- Sheet ID and title — from the title block
- Category — general, architectural, structural, energy, code_compliance, mechanical, plumbing, electrical
- What's on the page — key content items with exact values, specific enough to match correction letter items
- What's NOT on the page —
"NOT SHOWN: [item]"entries for expected- but-absent content (guided by extraction priorities) - Topics — keyword tags for routing
- Drawing zones — spatial map of where things are on the page
Step 4a: Orchestrator Review (ALWAYS — Not Optional)
After the assembly script runs, the orchestrator must read
binder-manifest.json and review it. This takes seconds and catches
things the script cannot.
Standard Review Checklist
- Read the full
binder-manifest.json - Verify
projectmetadata is populated (address, type, owner, sqft) - Verify page count matches PNG count
- Spot-check
sheet_idvalues look reasonable - Check that
key_contentarrays have specific values, not vague entries - If the script reported issues (exit code 1), fix them
- Fix any JSON errors, missing fields, or wrong categories
Cross-Page Consistency Check (Critical)
Vision models can hallucinate individual digits — a "3" read as "2", a "5"
as "6". When this happens on the cover sheet, the wrong value cascades into
project metadata and poisons everything downstream.
Every page's JSON fragment includes a title_block_address field — the
address as read independently from that page's title block. The orchestrator
must use these to verify project-level values:
- Collect all
title_block_addressvalues from every page entry - Majority vote on the address: the value that appears on the most pages
is the correct address. If the
project.addressdiffers from the majority, fix it. - Apply the same logic to other repeated values: project type, designer firm, and structural engineer firm appear on multiple title blocks. When there's a conflict, the majority wins.
- Log any corrections: when the orchestrator overrides a value, note what was changed and why (e.g., "Fixed address from 1222 to 1232 — cover sheet hallucination, 14/15 pages read 1232").
This check exists because in testing, the vision model misread "1232" as "1222" on one page, and that single error propagated through the entire manifest. With 15 pages each independently reading the title block, a single-page hallucination is trivially detectable.
This review is cheap (one file read + a few string comparisons) and prevents the scenario where subagents did great work but a single hallucination or script assembly glitch ruins the output.
Reading Title Blocks
Construction plan title blocks follow consistent conventions:
- Location: Bottom-right corner or right edge of each sheet
- Contains: Sheet number (e.g., "A2", "S1"), sheet title, designer/engineer name, project info, revision dates
- Sheet numbering convention:
CS= Cover SheetAprefix = Architectural (site plans, floor plans, elevations)Sprefix = Structural (foundation, framing, details)SNprefix = Structural NotesTprefix = Title 24 / EnergyAIAprefix = CalGreen/code checklistsMprefix = MechanicalPprefix = PlumbingEprefix = Electrical
Drawing Zone Mapping
To enable precise references like "Sheet S2, detail 8, mid-left quadrant":
- Divide each page into a grid (top/middle/bottom x left/center/right)
- For detail sheets with numbered detail bubbles, map bubble numbers to zones
- For plans, note which drawing is in which half (e.g., "left-half: foundation plan, right-half: framing plan")
Step 5: Validate Outputs
After extraction, verify:
- PNG count matches PDF page count
- Vision markdown files exist for every page
- Manifest JSON is valid and has entries for every page
- Every
sheet_idin the manifest matches what's visible in the PNG title block
Using Extraction Results
For Corrections Response (Flow 2)
When interpreting a corrections letter against extracted plans:
- Parse each correction item for keywords
- Match keywords against manifest
topicsandkey_contentarrays - Load only the matched page PNGs into context (vision) for verification
- Use the
pages-vision/markdown for quick text searches - Reference corrections by sheet ID and drawing zone: "See Sheet S2 (page 11), Shearwall Schedule in the mid-left quadrant"
For Permit Checklist (Flow 1)
When generating a permit checklist from extracted plans:
- Load the cover sheet manifest entry for project overview
- Walk each category (architectural, structural, energy) loading relevant pages
- Use vision markdown files for data extraction, PNGs for verification
- Cross-reference against ADU regulatory skill requirements
Typical Sheet Types in ADU Binders
For reference, a typical California ADU plan binder contains:
| Category | Typical Sheets | What to Look For |
|---|---|---|
| General | CS (Cover) | Scope of work, sheet index, lot coverage, general notes |
| Code | AIA.1, AIA.2 | CalGreen checklists, compliance checkboxes |
| Architectural | A1-A4 | Site plan, floor plan, elevations, sections, schedules |
| Structural | SN1-SN2, S1-S3 | Notes, foundation, framing, details, shearwall schedules |
| Energy | T-1 through T-3 | CF1R compliance, HVAC specs, mandatory requirements |
| MEP | M1, P1, E1 | Mechanical, plumbing, electrical (not always separate sheets) |
Orchestration Summary
The full extraction workflow. Hard limit: max 3 concurrent subagents, 1 page per subagent (4 GB RAM deployment environment).
Example for a 15-page binder:
Step 1: mkdir -p {output}/pages-png {output}/pages-text {output}/pages-vision
Step 2: bash scripts/extract-pages.sh INPUT.pdf {output}
→ Split PDF into PNGs (200 DPI)
→ Resize PNGs to 1568px max (API internal limit)
→ Run Tesseract OCR → pages-text/page-01.txt through page-15.txt
→ produces pages-png/page-01.png through page-15.png
Step 3: Rolling window of vision subagents (prompts/vision-extract-page.md)
→ Launch page-01, page-02, page-03 in parallel (3 in flight)
→ page-01 completes → launch page-04 (still 3 in flight)
→ page-03 completes → launch page-05
→ ... continue until all 15 pages queued ...
→ Wait for final subagents to complete
→ Verify: pages-vision/page-NN.md AND page-NN.json exist for all 15
Step 4: python3 scripts/assemble-manifest.py {output}/pages-vision {output}/binder-manifest.json
→ Reads all page-NN.json fragments
→ Assembles binder-manifest.json (deterministic, no LLM)
Step 4a: Orchestrator reads binder-manifest.json (ALWAYS, not optional)
→ Cross-page consistency check: majority-vote address + repeated values
→ Verifies project metadata, page count, key_content quality
→ Fixes any assembly issues, hallucinations, or missing fields
Step 5: Validate all outputs
Steps 1-2 are sequential (bash). Step 3 uses a rolling window — as each subagent finishes, the next page launches immediately. Each subagent reads one PNG, writes one .md and one .json. Step 4 is a fast Python script (no LLM call). Step 5 is orchestrator validation.
Why one page per subagent? Each subagent's conversation accumulates every image it reads. With 3 pages per subagent, the 3rd page's API calls include all 3 PNGs in context — wasting tokens, risking API image limits (2000px cap for >20 images in conversation), and degrading quality. One-per-subagent keeps exactly one image in context at all times.
Why inline fragments? Each vision subagent already has the page image in context and has done the deep analysis. Writing a manifest entry at that point is nearly free — just reformatting what it already knows into JSON. This is faster and more accurate than a separate manifest subagent re-reading all the markdown files.
Resources
scripts/
extract-pages.sh— Split PDF into per-page PNGs (200 DPI), resize to 1568px, run Tesseract OCR for hybrid text cross-referenceassemble-manifest.py— Assemble page JSON fragments into binder-manifest.json
prompts/
vision-extract-page.md— Subagent prompt template for single-page vision extraction (produces both markdown and JSON manifest fragment for one page)vision-extract-batch.md— Legacy batch prompt (retained for reference; the single-page approach in vision-extract-page.md supersedes this)build-manifest.md— Legacy manifest subagent prompt (retained for reference; the inline fragment approach supersedes this)
references/
manifest-schema.md— JSON schema and field descriptions for binder-manifest.jsonadu-extraction-priorities.md— Domain-aware extraction guide: what to capture, what to flag as absent, and corrections letter terminology by content typeextraction-findings.md— Lessons learned from testing on real construction PDFs