name: kreuzberg description: >- Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg CLI. license: Elastic-2.0 metadata: author: kreuzberg-dev version: "1.0" repository: https://github.com/kreuzberg-dev/kreuzberg
Kreuzberg Document Extraction
Kreuzberg extracts text, tables, metadata, and images from 91+ file formats. Use it for document processing, OCR, batch extraction, structured LLM extraction, and embeddings.
Run kreuzberg --help for all commands, kreuzberg <command> --help for full flag reference.
Core Usage
# Single file → stdout (text) or JSON
kreuzberg extract document.pdf
kreuzberg extract document.pdf --content-format markdown --format json
# Batch
kreuzberg batch *.pdf --content-format markdown
# Detect MIME type
kreuzberg detect unknown-file
LLM Structured Extraction
Extracts typed JSON from a document using a schema + LLM model. API key falls back to OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
kreuzberg extract-structured invoice.pdf \
--schema schema.json \
--model openai/gpt-4o \
--strict
Embeddings
Local ONNX (no API key) or provider-hosted. Not available in Homebrew — use Docker.
kreuzberg embed --text "hello world" --preset balanced
echo "some text" | kreuzberg embed --provider llm --model openai/text-embedding-3-small
Chunking
kreuzberg chunk --text "..." --chunk-size 500 --chunk-overlap 50
kreuzberg chunk --chunker-type markdown --text "# Heading\n\nParagraph..."
cat file.txt | kreuzberg chunk --chunker-type semantic --topic-threshold 0.8
Configuration
Only kreuzberg.toml is auto-discovered. YAML/JSON require --config <path>.
kreuzberg extract doc.pdf # finds kreuzberg.toml automatically
kreuzberg extract doc.pdf --config my.yaml # explicit for non-TOML
kreuzberg extract doc.pdf --config-json '{"ocr":{"language":"deu"}}'
Config file skeleton (field names are snake_case — max_chars not max_characters):
use_cache = true
enable_quality_processing = true
output_format = "markdown" # content format for file output
[ocr]
backend = "tesseract" # tesseract | paddle-ocr | easyocr
language = "eng" # ISO 639-3 for tesseract; short codes for paddle/easyocr
[chunking]
max_chars = 1000 # NOT max_characters
max_overlap = 200 # NOT overlap
[pdf_options]
extract_images = true
[server] # for `kreuzberg serve`
host = "127.0.0.1"
port = 8000
Extracting Images from PDFs
Images are not written to disk — they come back as byte arrays in JSON output. Two-step process required:
# Step 1: capture JSON
kreuzberg extract doc.pdf --pdf-extract-images true --format json > out.json
# Step 2: save images
python3 -c "
import json, pathlib
d = json.load(open('out.json'))
pathlib.Path('images').mkdir(exist_ok=True)
for img in d.get('images', []):
pathlib.Path(f'images/image_{img[\"image_index\"]}.{img[\"format\"]}').write_bytes(bytes(img['data']))
"
Key Flags (Non-Obvious)
| Flag | Note |
|---|---|
--format |
Wire format for CLI output: text (default for extract), json, toon (token-efficient JSON) |
--content-format |
Format of extracted text: plain, markdown, djot, html. --output-format is a deprecated alias. |
--token-reduction |
off/light/moderate/aggressive/maximum — reduce tokens before LLM consumption |
--acceleration |
ONNX provider: auto, cpu, coreml (macOS), cuda, tensorrt |
--pdf-extract-images |
Embeds image bytes in JSON result (see above) |
Common Pitfalls
--format≠--content-format: one controls the serialization envelope, the other the text inside it.- Config auto-discovery only finds
kreuzberg.toml— not YAML or JSON. - PDF images land in
result.images[]as byte arrays; nothing is written to disk automatically. embedis excluded from the Homebrew build; use Docker orcargo install.- For large docs with
extract-structured, use--token-reductionto stay within LLM context limits.
References
- Supported formats: grep
references/supported-formats.mdinstead of reading it whole e.g.grep '.mdoc' references/supported-formats.md