name: pdf-ingest description: Ingest a PDF (or folder of PDFs) into the local vector database at ~/.wiki-chroma. Use when the user says "ingest this PDF", "embed this PDF", "add to vdb", "index this PDF", or hands over a PDF and wants the vdb-query skill to be able to search it.
PDF Ingest into Vector DB
Writer-side counterpart to the vdb-query skill. Drives vdb-watcher (which wraps wiki_watcher.py) to convert a PDF to markdown via Docling, chunk it, and upsert into Chroma collection pdf_chunks at ~/.wiki-chroma.
No background daemon on this Mac
wiki_watcher.py supports a long-running watcher mode (start / run), but we don't use it here. This skill is one-shot only: each invocation runs ingest <path> or backfill <dir> synchronously in the foreground, then exits. There is no SessionStart hook, no heartbeat file, and no PID file is kept.
Never run these subcommands unless you want the daemon:
vdb-watcher start— would launch a detached worker.vdb-watcher run— would enter the long-running scan loop.vdb-watcher heartbeat <cwd>— would register a cwd for the daemon to watch.
When to use
- The user asks to add, ingest, embed, or index a PDF.
- The user hands over a PDF and wants future questions answered from it.
- The user wants to refresh a previously-ingested PDF (idempotent — re-runs replace prior chunks for the same
file_sha).
CLI
Ingest one PDF:
$HOME/.local/bin/vdb-watcher ingest "<pdf-path>"
Backfill a whole folder of PDFs:
$HOME/.local/bin/vdb-watcher backfill "<folder-path>"
Watch progress (run in a separate terminal while ingest is running):
$HOME/.local/bin/vdb-watcher progress
Check watcher status:
$HOME/.local/bin/vdb-watcher status
What happens during ingest
- The watcher stages the PDF into a temp dir.
- If the PDF has more than
WIKI_WATCHER_DOCLING_PDF_CHUNK_PAGESpages (default 16 on this Mac), it splits withgs(ghostscript) orpdfseparate+pdfunite(poppler). - Docling converts each slice to markdown via
~/.wiki-venv/bin/docling(CPU-only on Intel Mac). - The combined markdown is written to
~/.wiki-pdfs/<sha>_<filename>.md. wiki_index.pychunks the markdown (char-based, paragraph-boundary preferring) and upserts into Chroma with metadata:file_sha,source_path,source_filename,chunk_index,total_chunks.- If
WIKI_WATCHER_OBSIDIAN_AUTO_UPDATE=1(default), a landing page is also written to~/wiki/references/pdf-vdb/.
Expectations on this Mac
- First-ever ingest triggers a one-time Docling model download (
500MB-1GB into `/.cache`). - CPU-only ingest on Intel Mac is slow — expect 1-3 minutes per ~10 dense pages.
- Re-ingesting the same PDF is idempotent and updates in place.
After ingest
Use the vdb-query skill (or /vdb-query) to ask questions about the ingested PDF. Restrict to one source with --source "<pdf-stem>".
Python env
Pinned to ~/.wiki-venv. Never use system python3 for this stack.