name: docling description: Convert documents (PDF, DOCX, PPTX, XLSX, HTML, images, audio) to structured markdown, HTML, or JSON using Docling. Use when ingesting external documents, research papers, specs, or any non-markdown source into the project knowledge base. Use for document-to-markdown conversion before QMD indexing. category: 04-tooling layer: tooling requires: python: ">=3.10" related_skills: - "@qmd-search" - "@context-discovery"
Docling Skill
Docling converts diverse document formats into structured, AI-friendly representations. Use it to ingest external docs into the project knowledge base.
When to Use
- Converting PDFs (research papers, specs, datasheets) to markdown
- Ingesting DOCX/PPTX/XLSX files from stakeholders
- Extracting text and tables from scanned documents (OCR)
- Processing images containing text or diagrams
- Transcribing audio files (WAV, MP3) via ASR
- Preparing documents for QMD indexing (
@qmd-search) - Batch converting a folder of mixed-format documents
Supported Formats
| Format | Notes |
|---|---|
| Page layout, reading order, tables, formulas, image classification | |
| DOCX | Word documents |
| PPTX | PowerPoint presentations |
| XLSX | Excel spreadsheets |
| HTML | Web pages |
| Images | PNG, TIFF, JPEG — OCR extraction |
| Audio | WAV, MP3 — ASR transcription |
| LaTeX | .tex files |
| Plain text | .txt, .qmd, .Rmd |
| WebVTT | Subtitles/captions |
| USPTO | Patent documents |
| JATS | Academic articles |
| XBRL | Financial reports |
CLI Usage
Basic Conversion
# Convert a single file (outputs markdown by default)
docling document.pdf
# Convert from URL
docling https://arxiv.org/pdf/2408.09869
# Specify output directory
docling document.pdf --output docs/converted/
# Convert to specific format
docling document.pdf --to md # Markdown (default)
docling document.pdf --to json # Lossless JSON
docling document.pdf --to html # HTML
docling document.pdf --to doctags # DocTags format
VLM Pipeline (Better Quality for Complex PDFs)
# Uses GraniteDocling visual language model
docling --pipeline vlm --vlm-model granite_docling document.pdf
Batch Conversion
# Convert all PDFs in a directory
docling ./research-papers/
# Convert mixed formats
docling ./incoming-docs/
Python API
Basic Conversion
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Export to markdown
md = result.document.export_to_markdown()
# Export to other formats
html = result.document.export_to_html()
json_doc = result.document.export_to_dict() # lossless JSON
Batch Conversion
from docling.document_converter import DocumentConverter
from pathlib import Path
converter = DocumentConverter()
sources = list(Path("./research-papers").glob("*.pdf"))
for source in sources:
result = converter.convert(str(source))
output = Path("docs/converted") / f"{source.stem}.md"
output.write_text(result.document.export_to_markdown())
Advanced: Custom Pipeline
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
converter = DocumentConverter(
format_options={
"pdf": PdfFormatOption(pipeline_cls=StandardPdfPipeline)
}
)
result = converter.convert("document.pdf")
Workflow: Ingest External Docs into QMD
Standard pipeline for adding external documents to the searchable knowledge base:
# 1. Convert documents to markdown
docling ./incoming/ --output docs/external/
# 2. Add to QMD collection (if new directory)
qmd collection add docs/external --name external --mask "**/*.md"
qmd context add qmd://external "Converted external documents: research papers, specs, reference material"
# 3. Index and embed
qmd update
qmd embed
# 4. Search
qmd query "relevant topic" -c external
Workflow: Process Research for RFC Development
# Convert reference papers
docling https://arxiv.org/pdf/XXXX.XXXXX --output docs/research/
# Convert game design docs from stakeholders
docling game-design-spec.docx --output docs/design/
# Make searchable
qmd update && qmd embed
# Find relevant sections for RFC writing
qmd query "procedural dungeon generation algorithms" -c docs
MCP Server
Docling provides an MCP server for agentic applications:
pip install docling[mcp]
# Then configure as MCP server in settings
Workflow: Large PDF Ingestion (Chunked)
Large PDFs (100+ pages, image-heavy ebooks) will hit OCR memory limits if processed in one pass. Use chunked conversion to handle them reliably.
Why This Matters
The standard pipeline processes all pages at once. RapidOCR allocates large float32 arrays per page — a 120-page ebook with full-bleed images will OOM around page 100+. Chunking by page range avoids this.
Chunked Conversion Script
from docling.document_converter import DocumentConverter
from pathlib import Path
import math
def convert_large_pdf(pdf_path: str, output_dir: str, chunk_size: int = 20):
"""Convert a large PDF in page-range chunks to avoid OOM."""
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument(pdf_path)
total_pages = len(pdf)
pdf.close()
converter = DocumentConverter()
output = Path(output_dir)
output.mkdir(parents=True, exist_ok=True)
stem = Path(pdf_path).stem
all_md = []
num_chunks = math.ceil(total_pages / chunk_size)
for chunk_idx in range(num_chunks):
start = chunk_idx * chunk_size + 1
end = min((chunk_idx + 1) * chunk_size, total_pages)
print(f"Converting pages {start}-{end} of {total_pages}...")
try:
result = converter.convert(
pdf_path,
page_range=(start, end)
)
md = result.document.export_to_markdown()
all_md.append(f"<!-- pages {start}-{end} -->\n{md}")
except Exception as e:
print(f" Chunk {start}-{end} failed: {e}")
# Retry with smaller chunks
for page in range(start, end + 1):
try:
result = converter.convert(pdf_path, page_range=(page, page))
md = result.document.export_to_markdown()
all_md.append(f"<!-- page {page} -->\n{md}")
except Exception as e2:
all_md.append(f"<!-- page {page} FAILED: {e2} -->")
combined = "\n\n".join(all_md)
(output / f"{stem}.md").write_text(combined, encoding="utf-8")
print(f"Done. {total_pages} pages -> {output / stem}.md")
return combined
Usage
# For ebooks, game dev guides, large spec documents
convert_large_pdf(
"docs/assets/hud/2022_2DGameArt_EBook.pdf",
"docs/converted/",
chunk_size=15 # Smaller chunks for image-heavy PDFs
)
Quality Validation
After conversion, check quality before indexing:
def validate_conversion(md_path: str) -> dict:
"""Check conversion quality — detect OCR failures and sparse pages."""
text = Path(md_path).read_text(encoding="utf-8")
lines = text.split("\n")
total_lines = len(lines)
image_markers = sum(1 for l in lines if "<!-- image -->" in l)
failed_pages = sum(1 for l in lines if "FAILED:" in l)
empty_sections = sum(1 for l in lines if l.strip() == "")
quality = {
"total_lines": total_lines,
"image_markers": image_markers,
"failed_pages": failed_pages,
"image_density": image_markers / max(1, total_lines),
"text_density": (total_lines - image_markers - empty_sections) / max(1, total_lines),
}
if quality["image_density"] > 0.3:
quality["recommendation"] = "Image-heavy — consider VLM pipeline for better extraction"
elif failed_pages > 0:
quality["recommendation"] = f"{failed_pages} pages failed — retry those with VLM or smaller chunks"
else:
quality["recommendation"] = "Good quality — ready for QMD indexing"
return quality
Decision: Standard vs VLM Pipeline
| PDF Type | Pipeline | Chunk Size |
|---|---|---|
| Text-heavy (specs, research) | Standard | 30-50 pages |
| Mixed text+images (ebooks) | Standard first, VLM retry on failures | 15-20 pages |
| Image-heavy (art guides, slide decks) | VLM | 5-10 pages |
| Scanned documents | VLM | 10-15 pages |
Full Ingest Pipeline (Large PDF → Searchable)
# 1. Chunked conversion
python -c "
from docling_chunked import convert_large_pdf
convert_large_pdf('docs/assets/book.pdf', 'docs/converted/', chunk_size=15)
"
# 2. Validate
python -c "
from docling_validate import validate_conversion
print(validate_conversion('docs/converted/book.md'))
"
# 3. Index into QMD
qmd collection add docs/converted --name reference-books --mask "**/*.md"
qmd context add qmd://reference-books "Converted reference books and guides"
qmd update && qmd embed
# 4. Search
qmd query "2D lighting normal maps" -c reference-books
Important Notes
- First run downloads ML models (~1-2GB) for layout detection and OCR
- Heron is the default layout model — fast and accurate for most PDFs
- VLM pipeline (
--pipeline vlm) gives better results for complex layouts but is slower - Local processing — no cloud dependencies, safe for sensitive documents
- Python 3.10+ required (3.9 support dropped in v2.70.0)
- Already installed: Docling v2.77.0 at
C:\Users\User\AppData\Roaming\Python\Python313\site-packages - Large PDFs (100+ pages): Always use chunked conversion to avoid OOM — see workflow above
- Image-heavy PDFs: Use smaller chunk sizes (5-15 pages) and consider VLM pipeline
Checklist
- Use
doclingCLI for one-off conversions - Use Python API for batch/automated pipelines
- Use chunked conversion for PDFs over 50 pages
- Validate conversion quality before indexing
- Retry failed pages with VLM pipeline or smaller chunks
- Always output to
docs/subdirectory to keep project organized - Run
qmd update && qmd embedafter adding new converted docs - Prefer VLM pipeline for complex PDFs with mixed layouts
- Check conversion quality — review markdown output before indexing