name: glaw-opendataloader-pdf description: "Extract structured data from PDFs with OpenDataLoader PDF (Apache-2.0, #1 benchmark). Converts PDF → Markdown / JSON (with bounding boxes) / HTML / Tagged-PDF, locally, no cloud. Use for: 'convert PDF to markdown', 'PDF to JSON', 'extract tables from PDF', 'parse PDF for RAG', 'OCR a scanned PDF', 'extract formulas/LaTeX from PDF', 'describe charts in a PDF', 'auto-tag / make PDF accessible', 'hybrid mode PDF', 'opendataloader'. Handles the openjdk PATH + the Apple-Silicon --device cpu requirement automatically." allowed-tools: - Bash - Read
OpenDataLoader PDF — extraction skill
Local PDF parser already installed on this machine (CLI via uv tool, JDK via openjdk@21).
Two modes — fast (deterministic, local Java, no server) and hybrid (routes complex pages to a docling AI backend for far better tables/OCR/formulas).
Non-negotiable environment rules
javamust be on PATH.openjdk@21is keg-only, so prefix every command:
(The macOSexport PATH="/opt/homebrew/opt/openjdk@21/bin:$PATH"/usr/bin/javastub will NOT find it.)- Hybrid server MUST run with
--device cpuon this Apple-Silicon Mac. The default--device autoselects MPS and crashes the docling layout model withCannot convert a MPS Tensor to float64. CPU works cleanly and uses all cores. - Batch all files into ONE invocation. Each
glaw-opendataloader-pdfcall spawns a fresh JVM, so pass multiple files/folders at once — don't loop per-file.
Decision: which mode?
| Document | Mode | Why |
|---|---|---|
| Standard digital PDF, just need text/structure | Fast | 0.02s/page, no server needed |
| Complex / borderless / nested tables | Hybrid | +90% table accuracy |
| Scanned / image-only PDF | Hybrid + OCR | needs --force-ocr |
| Math formulas (→ LaTeX) | Hybrid + formula | --enrich-formula + --hybrid-mode full |
| Charts/images needing descriptions | Hybrid + picture | --enrich-picture-description + --hybrid-mode full |
| Make an untagged PDF accessible | Fast | -f tagged-pdf |
When unsure, start with Fast; escalate to Hybrid only if tables/scans/formulas look wrong.
Fast mode (no server)
export PATH="/opt/homebrew/opt/openjdk@21/bin:$PATH"
opendataloader-pdf file1.pdf file2.pdf folder/ -o output/ -f markdown,json
-o/--output-dir— output directory (NOT--output-folder)-f/--format— comma list:json, text, markdown, html, tagged-pdf- Useful flags:
--sanitize(redact emails/URLs/phones),--use-struct-tree(honor native PDF tags)
Hybrid mode (two steps)
Step 1 — start the backend (run in background; first launch downloads docling/OCR models, ~20–30s init):
export PATH="/opt/homebrew/opt/openjdk@21/bin:$PATH"
opendataloader-pdf-hybrid --port 5002 --device cpu
Wait until the log prints Application startup complete before converting. Add as needed:
--force-ocr(scanned PDFs) — and--ocr-lang "ko,en"/ja/ch_sim/ch_tra/de/fr/ar--enrich-formula(LaTeX) ·--enrich-picture-description(chart/image alt-text)
Step 2 — convert (separate shell; batch all inputs):
export PATH="/opt/homebrew/opt/openjdk@21/bin:$PATH"
opendataloader-pdf --hybrid docling-fast file1.pdf folder/ -o output/ -f json,markdown
- Add
--hybrid-mode fullwhenever the server has--enrich-formulaor--enrich-picture-descriptionon. - When done, stop the server:
pkill -f hybrid_server.
How to drive this as the agent
- Start the server with
run_in_background: true, then poll its log file forApplication startup complete(don't fixed-sleep — init time varies). - Reuse one running server for the whole batch; only restart to change OCR/enrich flags.
- If a hybrid run returns HTTP 500
Cannot convert a MPS Tensor to float64, the server was started without--device cpu— restart it correctly.
Output reference
JSON elements carry type (heading/paragraph/table/list/image/caption/formula), id,
page number, bounding box [left, bottom, right, top] in PDF points, and content —
ideal for RAG chunking + click-to-source citations. Markdown preserves heading hierarchy and
table structure for direct LLM context.
Also available
- Python:
import opendataloader_pdf; opendataloader_pdf.convert(input_path=[...], output_dir="out/", format="markdown,json", hybrid="docling-fast") - MCP server
glaw-opendataloader-pdf(user scope) — same engine via Model Context Protocol. - LangChain loader:
langchain-opendataloader-pdf.