name: ingest-vault
description: Bulk-admit files from a Marginalia mirror vault into the database, then let the LLM pipeline catch up in the background. Use when the user has dropped a stack of PDFs / markdown / notes into their vault directory and wants them indexed and searchable.
compatibility: Requires the marginalia CLI (Python 3.11+), a configured LLM ingest profile, and STORAGE_BACKEND=mirror for bulk ingest. For STORAGE_BACKEND=local, use marginalia upload per file instead.
allowed-tools: bash read
Ingest a vault into Marginalia
Marginalia is a personal knowledge base. The user keeps the canonical files on disk (the "mirror vault") and Marginalia tracks each file with a database entry plus AI-extracted metadata. This skill walks through the bulk-ingest path: admit fast, run LLM extraction async.
When to use
- The user dragged a folder of files into the vault and asks "index these".
- The user says "I just downloaded a bunch of papers, can you add them?"
- The user is migrating from another tool and wants files imported.
Prerequisites
- The vault root is configured (
MARGINALIA_HOMEenv ormarginalia init). STORAGE_BACKEND=mirror(the default). Forlocal, the user must use/uploadper file instead — bulk ingest is mirror-only.- Files are already in the vault directory. Bulk ingest does not COPY files in; it only registers what is already on disk.
Workflow
Start the REPL. From the vault directory:
marginaliaThe prompt looks like
marginalia[mirror />once connected. The bracket shows backend + cwd + queue depth.See what's new on disk.
/checkruns a scan and reports four categories:/checkOutput groups files into:
new(on disk, not in db),modified(content changed),moved(folder/name changed),missing(in db, gone from disk). Read the counts before applying — surprisingmissingnumbers often indicate the user is in the wrong directory.Apply everything. This is the bulk-ingest entry point:
/ingest --allIt admits each new file (creates the db row, hashes the bytes), then queues an LLM extraction task per file. Progress bar shows N/M for admission. When admission finishes, the prompt's
N busycount reflects the LLM queue.Let the queue drain in the background. The user can keep working — ask questions, run searches — while ingestion completes. The prompt's
N busyreading drops as tasks finish.If the user wants to wait explicitly, tell them: leave the REPL open; on exit, they'll be prompted "wait or quit".
qis safe — the next launch resumes viarecover_stuck_tasks.
Targeted ingest
If the user only wants part of the vault (say, one new folder):
/ingest path/to/folder
/ingest single_file.pdf
These accept relative paths from cwd. Same admission + queue flow as
--all, just scoped.
Common pitfalls
"Where's my file?" Mirror mode requires the file to live UNDER the vault root. If the user pasted a path outside the vault, the CLI prints
→ /upload is for copying files INTO the vault.and refuses. Direct them to either move the file into the vault or use/upload.Storage backend mismatch. If the user previously ran with
STORAGE_BACKEND=localand switched, lifespan startup raisesStorageBackendMismatchError. Tell them to runmarginalia storage migrate --from local --to mirror(or revert).Long queue, no apparent progress. The
N busycount reflects the task queue. If it's stuck above zero with no decrease over several minutes, the LLM provider may be unconfigured or throttled. CheckMARGINALIA_LLM_*env settings.
After ingest
Once N busy settles back near zero, the corpus is ready for:
- search-by-question → see
research-with-marginaliaskill - discovery / related-entries → see
discover-and-curateskill
One-shot commands
All of the above can be driven non-interactively by an external agent:
marginalia check --json
marginalia ingest --all --yes --json
marginalia ingest path/to/folder --yes --json
marginalia background --json
marginalia reprocess failed --json
marginalia reprocess folder <full_folder_id> failed --json
marginalia upload ./somewhere/paper.pdf /papers/
Add --json for machine-parseable output. --yes skips confirmation prompts.
The CLI auto-discovers the backend like the REPL. IDs must be full UUIDs.