pdf-to-markdown - SKILL.md Agent Skill

name: pdf-to-markdown description: Transform PDFs into pristine markdown optimized for LLM consumption. Uses docling for initial conversion, then AI-driven iterative cleanup to produce publication-quality output with perfect structure, consistent formatting, and zero artifacts. Handles any PDF type through adaptive pattern recognition - no hardcoded rules. Fast processing with --no-ocr flag.

PDF to Markdown Converter

Convert PDF documents to markdown using docling with intelligent cleanup and splitting.

Mandatory Workflow

You must follow the steps below in order. Use your TODO tool to complete each step including it's sub-tasks thouroughly and consecutively before proceeding with the next step.

Step 1: Environment Setup (One-time)

bash scripts/setup_venv.sh

Step 2: PDF Conversion

Convert your PDF to initial markdown (use --no-ocr for faster processing):

# Local file:
python scripts/convert_full.py document.pdf --no-ocr

# Or from URL (docling downloads automatically):
python scripts/convert_full.py https://example.com/document.pdf --no-ocr

Step 3: Follow the 5-Phase Cleanup Process

🚨 CRITICAL: Work through all 5 phases sequentially. Each phase builds on the previous one:

Phase 1: Document Denoising → cleanup-phase1-denoising.md
Phase 2: Headers/Footers → cleanup-phase2-headers-footers.md
Phase 3: Basic Numbered Sections → cleanup-phase3-basic-numbered-sections.md
Phase 4: Context-Aware Subordinates → cleanup-phase4-context-aware-subordinates.md
Phase 5: Spacing/Formatting → cleanup-phase5-spacing-formatting.md

For each phase: Follow the iterative process described in the phase-specific documentation.

Step 4: Consider Document Splitting

After completing all 5 phases, evaluate if splitting would improve usability:

# Analyze potential split points
python scripts/analyze_split_points.py document.md

# Test splitting (if desired)
python scripts/split_markdown.py document.md --heading-level 2 --dry-run

# Apply splitting (if satisfied with test results)
python scripts/split_markdown.py document.md --heading-level 2

Key Cleanup Principles

Work iteratively within each phase: dry-run → apply → verify

Small iterations: Fix 1-2 issues per iteration, not everything at once
Always verify: Extract samples after EACH change to confirm it worked
Test first: ALWAYS use --dry-run before applying substitutions
Document-specific patterns: Each PDF is unique - adapt to actual content
Safe recovery: Automatic backup files created for each change

Essential Commands for Each Phase

# Review document structure and patterns
python scripts/extract_samples.py document.md

# Test a substitution pattern  
sed 's/pattern/replacement/' document.md | head -20  # Preview changes

# Apply the pattern (creates automatic backup)
sed -i.backup 's/pattern/replacement/' document.md

# Verify the change worked as expected
python scripts/extract_samples.py document.md

Performance Notes

⚠️ Use --no-ocr flag unless processing scanned documents:

Without OCR: 5-10 minutes for 180-page PDF ✅
With OCR: 60+ minutes for same PDF ❌

See performance-guide.md for optimization details.

Core Scripts

convert_full.py

python scripts/convert_full.py <pdf_source> [--no-ocr] [-o output.md]

Supports both local files and URLs. Examples:

python scripts/convert_full.py document.pdf --no-ocr
python scripts/convert_full.py https://example.com/doc.pdf --no-ocr  
python scripts/convert_full.py https://example.com/doc.pdf -o custom.md --no-ocr

extract_samples.py

python scripts/extract_samples.py <markdown_file> [--min-repeats N]

Shows document structure and repeated patterns for cleanup planning.

extract_samples.py

python scripts/extract_samples.py <markdown_file> [--min-repeats N]

Shows document structure and repeated patterns for cleanup planning.

Cleanup with Native Sed

Use native sed for regex substitutions (faster and more reliable than Python wrappers):

Preview changes:

sed 's/old/new/g' document.md | head -20  # See first 20 lines of output

Apply changes (with automatic backup):

sed -i.backup 's/old/new/g' document.md

Pattern Development Strategy:

Use extract_samples.py to identify issues
Develop patterns specific to your document
Test with preview (sed 's/pattern/replacement/' file)
Apply with backup (sed -i.backup 's/pattern/replacement/' file)
Verify with extract_samples.py

See phase-specific documentation for detailed patterns and examples.

analyze_split_points.py / split_markdown.py

python scripts/analyze_split_points.py <markdown_file>
python scripts/split_markdown.py <markdown_file> --heading-level 2 [--dry-run]

Detailed Guides

Advanced Usage → docling-usage.md
Performance Optimization → performance-guide.md
Workflow Details → workflow-guide.md
Script Reference → script-reference.md