pdf-to-md

star 0

Use this skill when the user wants to convert PDF files to Markdown using Docling. This includes converting single PDFs, batch-converting directories of PDFs, extracting structured content (headings, tables, lists) from academic papers, converting scanned documents with OCR, or configuring Docling pipeline options for table detection and image extraction.

Bauhaus-InfAU By Bauhaus-InfAU schedule Updated 3/9/2026

name: pdf-to-md description: Use this skill when the user wants to convert PDF files to Markdown using Docling. This includes converting single PDFs, batch-converting directories of PDFs, extracting structured content (headings, tables, lists) from academic papers, converting scanned documents with OCR, or configuring Docling pipeline options for table detection and image extraction.

PDF to Markdown Conversion (Docling)

Overview

Convert PDF documents to structured Markdown using the Docling library. Docling uses ML models to understand document layout, preserving headings, tables, lists, and reading order. For advanced pipeline configuration, OCR settings, and performance tuning, see reference.md.

Quick Start

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown)

Install: pip install docling

Docling Library

Basic Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("paper.pdf")

# Export as Markdown
md = result.document.export_to_markdown()

# Page count
print(f"{len(result.document.pages)} pages")

Batch Conversion

from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
output_dir = Path("markdown_outputs")
output_dir.mkdir(exist_ok=True)

for pdf in Path("papers/").glob("*.pdf"):
    result = converter.convert(str(pdf))
    md = result.document.export_to_markdown()
    (output_dir / f"{pdf.stem}.md").write_text(md, encoding="utf-8")
    print(f"Converted {pdf.name}")

Custom Output Path

from pathlib import Path
from docling.document_converter import DocumentConverter

def convert_pdf(pdf_path, output_path=None):
    pdf_path = Path(pdf_path)
    if output_path is None:
        output_path = pdf_path.with_suffix(".md")

    converter = DocumentConverter()
    result = converter.convert(str(pdf_path))
    Path(output_path).write_text(
        result.document.export_to_markdown(), encoding="utf-8"
    )
    return output_path

Pipeline Configuration

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()

# Use accurate table detection (slower but better for complex tables)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("report.pdf")

Image Extraction

from pathlib import Path
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("paper.pdf")
image_dir = Path("images")
image_dir.mkdir(exist_ok=True)

for element, _level in result.document.iterate_items():
    if hasattr(element, 'image') and element.image is not None:
        img_path = image_dir / f"{element.self_ref}.png"
        element.image.pil_image.save(str(img_path))

OCR for Scanned Documents

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("scanned_document.pdf")
md = result.document.export_to_markdown()

Command-Line Usage

Basic Script

# Single file
python scripts/pdf_to_markdown.py report.pdf

# Single file with custom output
python scripts/pdf_to_markdown.py report.pdf output.md

# Batch convert a directory
python scripts/pdf_to_markdown.py ./papers/

# Batch with custom output directory
python scripts/pdf_to_markdown.py ./papers/ ./converted/

Advanced Script

# With image extraction
python scripts/pdf_to_markdown_advanced.py paper.pdf paper.md --with-images ./images

# With OCR for scanned documents
python scripts/pdf_to_markdown_advanced.py scanned.pdf output.md --ocr

# With accurate table detection
python scripts/pdf_to_markdown_advanced.py report.pdf report.md --accurate-tables

# Combine options
python scripts/pdf_to_markdown_advanced.py paper.pdf paper.md --ocr --accurate-tables --with-images ./img

Common Tasks

Academic Paper Conversion

Academic papers with sections, references, figures, and tables:

from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("paper.pdf")
md = result.document.export_to_markdown()

Lecture Slides

Slides often have sparse text with images — enable image extraction:

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("slides.pdf")

Table-Heavy Documents

For documents where table accuracy is critical:

pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

# Convert and check tables
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("data_report.pdf")
md = result.document.export_to_markdown()

Scanned PDFs

For scanned documents or image-based PDFs:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
result = converter.convert("scanned.pdf")

Quick Reference

Task Code
Basic conversion DocumentConverter().convert("file.pdf")
Export to Markdown result.document.export_to_markdown()
Page count len(result.document.pages)
Batch convert Loop over Path("dir").glob("*.pdf")
Accurate tables TableFormerMode.ACCURATE in pipeline options
Enable OCR pipeline_options.do_ocr = True
Extract images pipeline_options.generate_picture_images = True
CLI single file python scripts/pdf_to_markdown.py file.pdf
CLI batch python scripts/pdf_to_markdown.py ./dir/
CLI advanced python scripts/pdf_to_markdown_advanced.py in.pdf out.md --ocr

Error Handling

Error Cause Solution
ModuleNotFoundError: docling Docling not installed pip install docling
File not found Invalid path Check file path exists
Invalid PDF Corrupted or non-PDF file Verify file is a valid PDF
Write permission error Cannot write output Use a different output directory
Slow first conversion Model download on first use Wait for ~500 MB model download; models are cached after
GPU not detected CUDA/PyTorch not configured Falls back to CPU automatically; install torch with CUDA for GPU
UnicodeEncodeError Non-UTF-8 characters in output Ensure encoding='utf-8' when writing files
MemoryError PDF too large or complex Process fewer pages at a time; close other applications
Table detection issues Complex or borderless tables Use TableFormerMode.ACCURATE
Poor OCR quality Low-resolution scan Use higher-DPI source; try EasyOCR backend (see reference.md)
Windows symlink warnings HuggingFace cache issue Set HF_HUB_DISABLE_SYMLINKS_WARNING=1 (scripts do this automatically)

Next Steps

  • For advanced pipeline configuration, OCR backends, export formats, chunking for RAG, and performance tuning, see reference.md
  • For general PDF operations (merge, split, create, fill forms), see the pdf skill
Install via CLI
npx skills add https://github.com/Bauhaus-InfAU/infau-skill-base --skill pdf-to-md
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
Bauhaus-InfAU
Bauhaus-InfAU Explore all skills →