pdf-to-md - SKILL.md Agent Skill

name: pdf-to-md description: Use this skill when the user wants to convert PDF files to Markdown using Docling. This includes converting single PDFs, batch-converting directories of PDFs, extracting structured content (headings, tables, lists) from academic papers, converting scanned documents with OCR, or configuring Docling pipeline options for table detection and image extraction.

PDF to Markdown Conversion (Docling)

Overview

Convert PDF documents to structured Markdown using the Docling library. Docling uses ML models to understand document layout, preserving headings, tables, lists, and reading order. For advanced pipeline configuration, OCR settings, and performance tuning, see reference.md.

Quick Start

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown)

Install: pip install docling

Docling Library

Basic Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("paper.pdf")

# Export as Markdown
md = result.document.export_to_markdown()

# Page count
print(f"{len(result.document.pages)} pages")

Batch Conversion

from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
output_dir = Path("markdown_outputs")
output_dir.mkdir(exist_ok=True)

for pdf in Path("papers/").glob("*.pdf"):
    result = converter.convert(str(pdf))
    md = result.document.export_to_markdown()
    (output_dir / f"{pdf.stem}.md").write_text(md, encoding="utf-8")
    print(f"Converted {pdf.name}")

Custom Output Path

from pathlib import Path
from docling.document_converter import DocumentConverter

def convert_pdf(pdf_path, output_path=None):
    pdf_path = Path(pdf_path)
    if output_path is None:
        output_path = pdf_path.with_suffix(".md")

    converter = DocumentConverter()
    result = converter.convert(str(pdf_path))
    Path(output_path).write_text(
        result.document.export_to_markdown(), encoding="utf-8"
    )
    return output_path

Pipeline Configuration

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()

# Use accurate table detection (slower but better for complex tables)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("report.pdf")

Image Extraction

from pathlib import Path
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("paper.pdf")
image_dir = Path("images")
image_dir.mkdir(exist_ok=True)

for element, _level in result.document.iterate_items():
    if hasattr(element, 'image') and element.image is not None:
        img_path = image_dir / f"{element.self_ref}.png"
        element.image.pil_image.save(str(img_path))

OCR for Scanned Documents

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("scanned_document.pdf")
md = result.document.export_to_markdown()

Command-Line Usage

Basic Script

# Single file
python scripts/pdf_to_markdown.py report.pdf

# Single file with custom output
python scripts/pdf_to_markdown.py report.pdf output.md

# Batch convert a directory
python scripts/pdf_to_markdown.py ./papers/

# Batch with custom output directory
python scripts/pdf_to_markdown.py ./papers/ ./converted/

Advanced Script

# With image extraction
python scripts/pdf_to_markdown_advanced.py paper.pdf paper.md --with-images ./images

# With OCR for scanned documents
python scripts/pdf_to_markdown_advanced.py scanned.pdf output.md --ocr

# With accurate table detection
python scripts/pdf_to_markdown_advanced.py report.pdf report.md --accurate-tables

# Combine options
python scripts/pdf_to_markdown_advanced.py paper.pdf paper.md --ocr --accurate-tables --with-images ./img

Common Tasks

Academic Paper Conversion

Academic papers with sections, references, figures, and tables:

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("paper.pdf")
md = result.document.export_to_markdown()

Lecture Slides

Slides often have sparse text with images — enable image extraction:

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("slides.pdf")

Table-Heavy Documents

For documents where table accuracy is critical:

pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

# Convert and check tables
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("data_report.pdf")
md = result.document.export_to_markdown()

Scanned PDFs

For scanned documents or image-based PDFs:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("scanned.pdf")

Quick Reference

Task	Code
Basic conversion	`DocumentConverter().convert("file.pdf")`
Export to Markdown	`result.document.export_to_markdown()`
Page count	`len(result.document.pages)`
Batch convert	Loop over `Path("dir").glob("*.pdf")`
Accurate tables	`TableFormerMode.ACCURATE` in pipeline options
Enable OCR	`pipeline_options.do_ocr = True`
Extract images	`pipeline_options.generate_picture_images = True`
CLI single file	`python scripts/pdf_to_markdown.py file.pdf`
CLI batch	`python scripts/pdf_to_markdown.py ./dir/`
CLI advanced	`python scripts/pdf_to_markdown_advanced.py in.pdf out.md --ocr`

Error Handling

Error	Cause	Solution
`ModuleNotFoundError: docling`	Docling not installed	`pip install docling`
File not found	Invalid path	Check file path exists
Invalid PDF	Corrupted or non-PDF file	Verify file is a valid PDF
Write permission error	Cannot write output	Use a different output directory
Slow first conversion	Model download on first use	Wait for ~500 MB model download; models are cached after
GPU not detected	CUDA/PyTorch not configured	Falls back to CPU automatically; install `torch` with CUDA for GPU
`UnicodeEncodeError`	Non-UTF-8 characters in output	Ensure `encoding='utf-8'` when writing files
`MemoryError`	PDF too large or complex	Process fewer pages at a time; close other applications
Table detection issues	Complex or borderless tables	Use `TableFormerMode.ACCURATE`
Poor OCR quality	Low-resolution scan	Use higher-DPI source; try EasyOCR backend (see reference.md)
Windows symlink warnings	HuggingFace cache issue	Set `HF_HUB_DISABLE_SYMLINKS_WARNING=1` (scripts do this automatically)

Next Steps

For advanced pipeline configuration, OCR backends, export formats, chunking for RAG, and performance tuning, see reference.md
For general PDF operations (merge, split, create, fill forms), see the pdf skill