name: pdf-to-markdown description: Transform PDFs into pristine markdown optimized for LLM consumption. Uses docling for initial conversion, then AI-driven iterative cleanup to produce publication-quality output with perfect structure, consistent formatting, and zero artifacts. Handles any PDF type through adaptive pattern recognition - no hardcoded rules. Fast processing with --no-ocr flag.
PDF to Markdown Converter
Convert PDF documents to markdown using docling with intelligent cleanup and splitting.
Mandatory Workflow
You must follow the steps below in order. Use your TODO tool to complete each step including it's sub-tasks thouroughly and consecutively before proceeding with the next step.
Step 1: Environment Setup (One-time)
bash scripts/setup_venv.sh
Step 2: PDF Conversion
Convert your PDF to initial markdown (use --no-ocr for faster processing):
# Local file:
python scripts/convert_full.py document.pdf --no-ocr
# Or from URL (docling downloads automatically):
python scripts/convert_full.py https://example.com/document.pdf --no-ocr
Step 3: Follow the 5-Phase Cleanup Process
๐จ CRITICAL: Work through all 5 phases sequentially. Each phase builds on the previous one:
- Phase 1: Document Denoising โ cleanup-phase1-denoising.md
- Phase 2: Headers/Footers โ cleanup-phase2-headers-footers.md
- Phase 3: Basic Numbered Sections โ cleanup-phase3-basic-numbered-sections.md
- Phase 4: Context-Aware Subordinates โ cleanup-phase4-context-aware-subordinates.md
- Phase 5: Spacing/Formatting โ cleanup-phase5-spacing-formatting.md
For each phase: Follow the iterative process described in the phase-specific documentation.
Step 4: Consider Document Splitting
After completing all 5 phases, evaluate if splitting would improve usability:
# Analyze potential split points
python scripts/analyze_split_points.py document.md
# Test splitting (if desired)
python scripts/split_markdown.py document.md --heading-level 2 --dry-run
# Apply splitting (if satisfied with test results)
python scripts/split_markdown.py document.md --heading-level 2
Key Cleanup Principles
Work iteratively within each phase: dry-run โ apply โ verify
- Small iterations: Fix 1-2 issues per iteration, not everything at once
- Always verify: Extract samples after EACH change to confirm it worked
- Test first: ALWAYS use
--dry-runbefore applying substitutions - Document-specific patterns: Each PDF is unique - adapt to actual content
- Safe recovery: Automatic backup files created for each change
Essential Commands for Each Phase
# Review document structure and patterns
python scripts/extract_samples.py document.md
# Test a substitution pattern
sed 's/pattern/replacement/' document.md | head -20 # Preview changes
# Apply the pattern (creates automatic backup)
sed -i.backup 's/pattern/replacement/' document.md
# Verify the change worked as expected
python scripts/extract_samples.py document.md
Performance Notes
โ ๏ธ Use --no-ocr flag unless processing scanned documents:
- Without OCR: 5-10 minutes for 180-page PDF โ
- With OCR: 60+ minutes for same PDF โ
See performance-guide.md for optimization details.
Core Scripts
convert_full.py
python scripts/convert_full.py <pdf_source> [--no-ocr] [-o output.md]
Supports both local files and URLs. Examples:
python scripts/convert_full.py document.pdf --no-ocr
python scripts/convert_full.py https://example.com/doc.pdf --no-ocr
python scripts/convert_full.py https://example.com/doc.pdf -o custom.md --no-ocr
extract_samples.py
python scripts/extract_samples.py <markdown_file> [--min-repeats N]
Shows document structure and repeated patterns for cleanup planning.
extract_samples.py
python scripts/extract_samples.py <markdown_file> [--min-repeats N]
Shows document structure and repeated patterns for cleanup planning.
Cleanup with Native Sed
Use native sed for regex substitutions (faster and more reliable than Python wrappers):
Preview changes:
sed 's/old/new/g' document.md | head -20 # See first 20 lines of output
Apply changes (with automatic backup):
sed -i.backup 's/old/new/g' document.md
Pattern Development Strategy:
- Use
extract_samples.pyto identify issues - Develop patterns specific to your document
- Test with preview (
sed 's/pattern/replacement/' file) - Apply with backup (
sed -i.backup 's/pattern/replacement/' file) - Verify with
extract_samples.py
See phase-specific documentation for detailed patterns and examples.
analyze_split_points.py / split_markdown.py
python scripts/analyze_split_points.py <markdown_file>
python scripts/split_markdown.py <markdown_file> --heading-level 2 [--dry-run]
Detailed Guides
- Advanced Usage โ docling-usage.md
- Performance Optimization โ performance-guide.md
- Workflow Details โ workflow-guide.md
- Script Reference โ script-reference.md