name: doc_parser description: Process scanned or structured PDFs using Docling, OCR validation, and VLM extraction to synthesize them into unified Obsidian-ready markdown study notes. metadata: openclaw: emoji: π display_name: ζδ»Άθ§£ζ state_tracking: phases:
- p0c
- p0b
- p0a
- p1a
- p1b_s
- p1b
- p1c
- p1d labels: p0c: P0c (MarkItDown θ½ζ) p0b: P0b (εηζε) p0a: P0a (θ¨Ίζ·) p1a: P1a (ζε) p1b_s: P1b-S (ζεζ·¨ε) p1b: P1b (ειε) p1c: P1c (OCRθ©δΌ°) p1d: P1d (VLMθ¦θ¦Ί) io_contracts: consumes:
- application/pdf
- application/vnd.openxmlformats-officedocument.presentationml.presentation
- application/vnd.openxmlformats-officedocument.wordprocessingml.document
- application/vnd.openxmlformats-officedocument.spreadsheetml.sheet produces:
- text/markdown
Doc Parser Skill
Pipeline: PDF β Diagnose β Extract β OCR Gate β VLM Vision β Synthesis
Quick Start
# 1. Drop PDF files into the Universal Inbox
cp textbook.pdf data/raw/<SubjectName>/
# 2. Run in headless batch mode
python3 skills/doc_parser/scripts/run_all.py --process-all
# 3. Check pipeline progress
cat data/doc_parser/state/checklist.md
Core Extraction Architecture (V2.0 Antigravity)
| Phase | Script | Function |
|---|---|---|
| P00a | p00a_diagnostic.py |
Security check + PDF metadata extraction |
| P00b | p00b_png_pipeline.py |
Tesseract OCR and layout extraction for direct PNG inputs |
| P01a | p01a_engine.py |
Primary Docling extraction β raw_extracted.md |
| P01b | p01b_vector_charts.py |
Vector diagram detection and captioning |
| P01c | p01c_ocr_gate.py |
Adaptive OCR decision (triggers Telegram HITL when confidence is low) |
| P01d | p01d_vlm_vision.py |
Figure / image analysis via VLM (vision language model) |
| Handoff | (Autonomous) | The RouterAgent automatically forwards the extracted content to the proofreader skill for AI verification and dashboard review. |
Common Commands
# Run full pipeline on all pending PDFs
python3 skills/doc_parser/scripts/run_all.py --process-all
# Run from a specific phase
python3 skills/doc_parser/scripts/run_all.py --from 2
# Force re-run on all files
python3 skills/doc_parser/scripts/run_all.py --process-all --force
# Process a single subject
python3 skills/doc_parser/scripts/run_all.py --subject <SubjectName>
Security Contract
- Source PDF files are immutable β they are read-only and never modified or moved.
- All extracted content is written via
AtomicWriterto guarantee crash-safe output. - The Security Manager runs a path-traversal check on every filename before any processing begins.
Global Standards
- Zero Temperature:
config.yamlenforcestemperature: 0to guarantee deterministic, hallucination-free outputs. - Headless CLI: Supports
--process-all,--from,--force,--resume,--log-jsonfor full CI/CD compatibility. - Preflight Check: Validates all dependencies and config on every run before processing begins.
- Checkpoint Resume: All phase completions are saved to
state/; use--resumeto continue after an interruption. - macOS native notifications (
osascript) and gracefulKeyboardInterrupthandling with checkpoint save.