name: kb-ingest-bulk
description: Parallel map-reduce bulk ingest for large source sets. Map phase dispatches read-only knowledge-extractor agents (≤N-wide, default Haiku) emitting per-source JSON extracts; Python routes extracts by target library file (fuzzy-merging new topics, pre-allocating new files, flagging oversized topics); reduce phase dispatches one agent-knowledge-updater per file (parallel, one writer per file); single shelf-index rebuild + one log.md entry. Resumable. Supersedes kb-ingest-batch.
disable-model-invocation: false
argument-hint: "<glob|dir|file-list> [--library ] [--parallel ] [--extractor-model ] [--size-threshold ] [--retry-failed] [--clean]"
Bulk Ingestion (map-reduce)
Load a large source set into the knowledge base in parallel. Four phases: map (parallel extract) → route (Python) → reduce (parallel synthesis, one writer per file) → finalize (one rebuild + one log entry).
Arguments
| Argument | Description |
|---|---|
<glob|dir|file-list> |
Sources to ingest |
--library <path> |
Target library, bypassing CLAUDE.md resolution (default: CLAUDE.md-resolved). For isolated testing / multi-library — see #209 |
--parallel <N> |
Concurrency for map + reduce rounds (default 16, max 64) |
--extractor-model <id> |
Map model (default claude-haiku-4-5) |
--size-threshold <tokens> |
Per-file reduce size guard (default 200000) |
--retry-failed |
Re-queue failed sources/targets from a prior run |
--clean |
Remove library/.extracts/ after a successful run |
Preflight
- Resolve
library_path: if--librarygiven, use it; else read CLAUDE.md## Knowledge Basesection. Deriveshelf_index_path = <library>/_shelf-index.md,log_path = <library>/log.md,extracts_dir = <library>/.extracts. - Verify
knowledge-extractorandagent-knowledge-updateragents are available. - Clamp
--parallelto [1, 64].
Module bootstrap
Use this CLAUDE_PLUGIN_ROOT importlib loader block to import the
sdlc_knowledge_base_scripts.kb_ingest_bulk helpers. It is self-contained — no
other file needs to be opened.
python3 -c "
import sys, os, importlib.util, json
PLUGIN_ROOT = os.environ.get('CLAUDE_PLUGIN_ROOT', '')
SCRIPTS = os.path.join(PLUGIN_ROOT, 'scripts')
INIT = os.path.join(SCRIPTS, '__init__.py')
if os.path.isfile(INIT) and 'sdlc_knowledge_base_scripts' not in sys.modules:
spec = importlib.util.spec_from_file_location(
'sdlc_knowledge_base_scripts', INIT, submodule_search_locations=[SCRIPTS])
if spec and spec.loader:
mod = importlib.util.module_from_spec(spec)
sys.modules['sdlc_knowledge_base_scripts'] = mod
spec.loader.exec_module(mod)
from sdlc_knowledge_base_scripts.kb_ingest_bulk import (
discover_sources, build_bulk_manifest, load_manifest, save_manifest,
retry_failed, persist_extract, slug_for_source, mark_source_extracted,
mark_source_failed, route_extracts, format_extract_prompt,
format_reduce_prompt, ReduceDispatchRequest, mark_target_reduced,
mark_target_failed, summarize_run, write_log_entry, ExtractDispatchRequest
)
"
Phase 1 — Map (parallel ≤N)
discover_sources(<spec>)→ source list;build_bulk_manifest(...)(merge prior manifest viaload_manifestif present;retry_failedif--retry-failed);save_manifest.- For each pending source whose extract file does NOT already exist (resume skip):
dispatch
knowledge-extractorwithformat_extract_prompt(...), up to N concurrently (parallel Agent-tool calls). Capture each agent's JSON reply. - After each agent:
persist_extract(extracts_dir, slug, json)thenmark_source_extracted; on error/invalid-JSONmark_source_failed.save_manifestafter each round.
Phase 2 — Route (Python, no agents)
- Load all extract JSON files. Compute
existing_files= set of*.mdin the library (excluding_shelf-index.md,log.md,_index.md, and anything underraw/). route = route_extracts(extracts, existing_files, size_threshold). Record eachroute.targetsentry intomanifest["targets"]aspendingwithsource_countandis_new. Reportroute.oversized(skipped).save_manifest.
Phase 3 — Reduce (parallel ≤N, one writer per file)
- For each pending target: dispatch ONE
agent-knowledge-updaterwithformat_reduce_prompt(ReduceDispatchRequest(...)), up to N concurrently. Exactly one agent per file — never two agents on the same file. - After each:
mark_target_reduced(success) ormark_target_failed(error).save_manifestafter each round.
Phase 4 — Finalize (once)
- Rebuild the shelf-index once via
build_shelf_index.main([library_path]). write_log_entry(log_path, "## [<date>] ingest-bulk\n" + summarize_run(manifest, route.oversized)).- Print the summary table. If
--clean, removeextracts_dir.
Exit / partial failure
A failed source drops out of routing; a failed target leaves its extracts on disk
for --retry-failed. Report all failures + oversized topics. Non-zero exit only if
nothing succeeded.