pdf-to-markdown

star 1

Transform PDFs into pristine markdown optimized for LLM consumption. Uses docling for initial conversion, then AI-driven iterative cleanup to produce publication-quality output with perfect structure, consistent formatting, and zero artifacts. Handles any PDF type through adaptive pattern recognition - no hardcoded rules. Fast processing with --no-ocr flag.

tbrandenburg By tbrandenburg schedule Updated 2/28/2026

name: pdf-to-markdown description: Transform PDFs into pristine markdown optimized for LLM consumption. Uses docling for initial conversion, then AI-driven iterative cleanup to produce publication-quality output with perfect structure, consistent formatting, and zero artifacts. Handles any PDF type through adaptive pattern recognition - no hardcoded rules. Fast processing with --no-ocr flag.

PDF to Markdown Converter

Convert PDF documents to markdown using docling with intelligent cleanup and splitting.

Mandatory Workflow

You must follow the steps below in order. Use your TODO tool to complete each step including it's sub-tasks thouroughly and consecutively before proceeding with the next step.

Step 1: Environment Setup (One-time)

bash scripts/setup_venv.sh

Step 2: PDF Conversion

Convert your PDF to initial markdown (use --no-ocr for faster processing):

# Local file:
python scripts/convert_full.py document.pdf --no-ocr

# Or from URL (docling downloads automatically):
python scripts/convert_full.py https://example.com/document.pdf --no-ocr

Step 3: Follow the 5-Phase Cleanup Process

๐Ÿšจ CRITICAL: Work through all 5 phases sequentially. Each phase builds on the previous one:

  1. Phase 1: Document Denoising โ†’ cleanup-phase1-denoising.md
  2. Phase 2: Headers/Footers โ†’ cleanup-phase2-headers-footers.md
  3. Phase 3: Basic Numbered Sections โ†’ cleanup-phase3-basic-numbered-sections.md
  4. Phase 4: Context-Aware Subordinates โ†’ cleanup-phase4-context-aware-subordinates.md
  5. Phase 5: Spacing/Formatting โ†’ cleanup-phase5-spacing-formatting.md

For each phase: Follow the iterative process described in the phase-specific documentation.

Step 4: Consider Document Splitting

After completing all 5 phases, evaluate if splitting would improve usability:

# Analyze potential split points
python scripts/analyze_split_points.py document.md

# Test splitting (if desired)
python scripts/split_markdown.py document.md --heading-level 2 --dry-run

# Apply splitting (if satisfied with test results)
python scripts/split_markdown.py document.md --heading-level 2

Key Cleanup Principles

Work iteratively within each phase: dry-run โ†’ apply โ†’ verify

  • Small iterations: Fix 1-2 issues per iteration, not everything at once
  • Always verify: Extract samples after EACH change to confirm it worked
  • Test first: ALWAYS use --dry-run before applying substitutions
  • Document-specific patterns: Each PDF is unique - adapt to actual content
  • Safe recovery: Automatic backup files created for each change

Essential Commands for Each Phase

# Review document structure and patterns
python scripts/extract_samples.py document.md

# Test a substitution pattern  
sed 's/pattern/replacement/' document.md | head -20  # Preview changes

# Apply the pattern (creates automatic backup)
sed -i.backup 's/pattern/replacement/' document.md

# Verify the change worked as expected
python scripts/extract_samples.py document.md

Performance Notes

โš ๏ธ Use --no-ocr flag unless processing scanned documents:

  • Without OCR: 5-10 minutes for 180-page PDF โœ…
  • With OCR: 60+ minutes for same PDF โŒ

See performance-guide.md for optimization details.

Core Scripts

convert_full.py

python scripts/convert_full.py <pdf_source> [--no-ocr] [-o output.md]

Supports both local files and URLs. Examples:

python scripts/convert_full.py document.pdf --no-ocr
python scripts/convert_full.py https://example.com/doc.pdf --no-ocr  
python scripts/convert_full.py https://example.com/doc.pdf -o custom.md --no-ocr

extract_samples.py

python scripts/extract_samples.py <markdown_file> [--min-repeats N]

Shows document structure and repeated patterns for cleanup planning.

extract_samples.py

python scripts/extract_samples.py <markdown_file> [--min-repeats N]

Shows document structure and repeated patterns for cleanup planning.

Cleanup with Native Sed

Use native sed for regex substitutions (faster and more reliable than Python wrappers):

Preview changes:

sed 's/old/new/g' document.md | head -20  # See first 20 lines of output

Apply changes (with automatic backup):

sed -i.backup 's/old/new/g' document.md

Pattern Development Strategy:

  1. Use extract_samples.py to identify issues
  2. Develop patterns specific to your document
  3. Test with preview (sed 's/pattern/replacement/' file)
  4. Apply with backup (sed -i.backup 's/pattern/replacement/' file)
  5. Verify with extract_samples.py

See phase-specific documentation for detailed patterns and examples.

analyze_split_points.py / split_markdown.py

python scripts/analyze_split_points.py <markdown_file>
python scripts/split_markdown.py <markdown_file> --heading-level 2 [--dry-run]

Detailed Guides

Install via CLI
npx skills add https://github.com/tbrandenburg/skills --skill pdf-to-markdown
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
tbrandenburg
tbrandenburg Explore all skills →