pdf-ingest - SKILL.md Agent Skill

name: pdf-ingest description: Ingest a PDF (or folder of PDFs) into the local vector database at ~/.wiki-chroma. Use when the user says "ingest this PDF", "embed this PDF", "add to vdb", "index this PDF", or hands over a PDF and wants the vdb-query skill to be able to search it.

PDF Ingest into Vector DB

Writer-side counterpart to the vdb-query skill. Drives vdb-watcher (which wraps wiki_watcher.py) to convert a PDF to markdown via Docling, chunk it, and upsert into Chroma collection pdf_chunks at ~/.wiki-chroma.

No background daemon on this Mac

wiki_watcher.py supports a long-running watcher mode (start / run), but we don't use it here. This skill is one-shot only: each invocation runs ingest <path> or backfill <dir> synchronously in the foreground, then exits. There is no SessionStart hook, no heartbeat file, and no PID file is kept.

Never run these subcommands unless you want the daemon:

vdb-watcher start — would launch a detached worker.
vdb-watcher run — would enter the long-running scan loop.
vdb-watcher heartbeat <cwd> — would register a cwd for the daemon to watch.

When to use

The user asks to add, ingest, embed, or index a PDF.
The user hands over a PDF and wants future questions answered from it.
The user wants to refresh a previously-ingested PDF (idempotent — re-runs replace prior chunks for the same file_sha).

CLI

Ingest one PDF:

$HOME/.local/bin/vdb-watcher ingest "<pdf-path>"

Backfill a whole folder of PDFs:

$HOME/.local/bin/vdb-watcher backfill "<folder-path>"

Watch progress (run in a separate terminal while ingest is running):

$HOME/.local/bin/vdb-watcher progress

Check watcher status:

$HOME/.local/bin/vdb-watcher status

What happens during ingest

The watcher stages the PDF into a temp dir.
If the PDF has more than WIKI_WATCHER_DOCLING_PDF_CHUNK_PAGES pages (default 16 on this Mac), it splits with gs (ghostscript) or pdfseparate+pdfunite (poppler).
Docling converts each slice to markdown via ~/.wiki-venv/bin/docling (CPU-only on Intel Mac).
The combined markdown is written to ~/.wiki-pdfs/<sha>_<filename>.md.
wiki_index.py chunks the markdown (char-based, paragraph-boundary preferring) and upserts into Chroma with metadata: file_sha, source_path, source_filename, chunk_index, total_chunks.
If WIKI_WATCHER_OBSIDIAN_AUTO_UPDATE=1 (default), a landing page is also written to ~/wiki/references/pdf-vdb/.

Expectations on this Mac

First-ever ingest triggers a one-time Docling model download (~~500MB-1GB into `~~/.cache`).
CPU-only ingest on Intel Mac is slow — expect 1-3 minutes per ~10 dense pages.
Re-ingesting the same PDF is idempotent and updates in place.

After ingest

Use the vdb-query skill (or /vdb-query) to ask questions about the ingested PDF. Restrict to one source with --source "<pdf-stem>".

Python env

Pinned to ~/.wiki-venv. Never use system python3 for this stack.