name: fact-extractor version: 0.1.0 description: Stage cleaned chapter text and extract quote-grounded minimal fact/QRA candidates from source text or primary/context chunks, with streaming model calls, retries, validation, and durable artifacts. model: any category: extraction tags: - extraction - facts - qra - jsonl - chutes - scillm entrypoint: ./run.sh complies: - best-practices-skills - best-practices-python - best-practices-scillm - best-practices-arangodb
fact-extractor
Use this skill when a project needs source-grounded, recall-useful fact or QRA candidate extraction from books, chapters, documents, or cleaned text chunks.
This skill creates an intermediate extraction artifact. It does not create
canonical /create-qras records, does not write persona_memory, does not write
lessons, and does not upsert to ArangoDB.
Contract
The model-owned output is minimal JSONL. Each line must contain exactly:
{"question":"string","answer":"string","claim":"string","evidence_quote":"exact quote","factuality":"narration_assertion|character_speech|character_thought|reported_story|uncertain_narration","tom":null}
For non-null theory-of-mind records, tom must contain exactly:
{"holder":"string","mental_state":"belief|intention|emotion|perception|preference|uncertainty|evaluation","target":"string"}
The skill is responsible for deterministic cleaning/staging, chapter metadata, chunk spans, validation, retry degradation, and artifact preservation.
Required Behavior
- Split source text into paragraph-aware, sentence-boundary-safe chunks.
- Stage raw book/audiobook transcript sources into normalized chapter text files before extraction when the source has not already been cleaned.
- Render overlap as
<context_before>,<primary_text>, and<context_after>. - Instruct the model to copy every
evidence_quotefromprimary_textonly. - Use streaming Chutes/scillm-compatible chat completion calls.
- Preserve each rendered prompt payload, raw SSE stream, parsed model JSONL, validation report, accepted records, aggregate report, and status file.
- Validate JSONL shape, exact quote grounding, primary-text quote origin, closed factuality vocabulary, theory-of-mind shape, duplicate records, HTTP status, and stream completion.
- Treat density misses as warnings unless the record count is zero or exceeds the configured hard cap.
- Retry failed chunks with lower output pressure.
- Resume by skipping chunks with existing
accepted_records.jsonlunless--forceis supplied. - Enrich every accepted record with deterministic
book,chapter,chapter_id,chunk_id, source SHA, primary/context spans, and absolute evidence quote spans. Aggregate validation must fail if accepted records lack this metadata.
Commands
Stage a raw book transcript into cleaned chapter files:
skills/fact-extractor/run.sh stage-book ./Galaxy_in_Flames/text.md \
--book "Galaxy in Flames" \
--book-id galaxy_in_flames \
--out /mnt/storage12tb/skills/fact-extractor/staged/galaxy_in_flames \
--force
For audiobook source directories, prefer text.md when present. Do not use
overlapping clean/*.json fragments as chapter inputs unless no coherent text
source exists.
Run a full chapter extraction:
skills/fact-extractor/run.sh chapter ./chapter_02.txt \
--book "Horus Rising" \
--chapter "Chapter 02" \
--chapter-id horus_rising_ch02 \
--persona-id horus_lupercal \
--out /mnt/storage12tb/skills/fact-extractor/outputs/horus_rising_ch02 \
--target-chars 3500 \
--max-chars 4500 \
--concurrency 1
Create chunks only:
skills/fact-extractor/run.sh chunk ./chapter_02.txt \
--out /tmp/chapter_02_chunks.jsonl \
--document-id horus_rising_ch02
Create a book-level chunks artifact from an audiobook extraction manifest:
skills/fact-extractor/run.sh chunk-book \
--chapters /mnt/storage12tb/skills/extract-audiobook/outputs/galaxy_in_flames/chapters.jsonl \
--out /mnt/storage12tb/skills/audiobook-extractor/outputs/galaxy_in_flames/chunks.jsonl \
--book "Galaxy in Flames" \
--book-id galaxy_in_flames
Append a book-level chapter progress event from a per-chapter extraction output:
skills/fact-extractor/run.sh book-progress \
--book-root /mnt/storage12tb/skills/fact-extractor/outputs/galaxy_in_flames \
--chapter-out /mnt/storage12tb/skills/fact-extractor/outputs/galaxy_in_flames/chapter_01 \
--book "Galaxy in Flames" \
--book-id galaxy_in_flames \
--chapter "Chapter 01" \
--chapter-id galaxy_in_flames_ch01 \
--status running
Extract from existing chunk JSONL:
skills/fact-extractor/run.sh extract \
--chunks /tmp/chapter_02_chunks.jsonl \
--out /mnt/storage12tb/skills/fact-extractor/outputs/horus_rising_ch02 \
--book "Horus Rising" \
--chapter "Chapter 02" \
--chapter-id horus_rising_ch02 \
--persona-id horus_lupercal
Check dependencies and executable wiring:
skills/fact-extractor/run.sh doctor --json
Merge per-chapter extraction outputs into a book-level accepted facts artifact:
skills/fact-extractor/run.sh merge-accepted \
/mnt/storage12tb/skills/fact-extractor/outputs/galaxy_in_flames \
--out /mnt/storage12tb/skills/audiobook-extractor/outputs/galaxy_in_flames/accepted_records.jsonl
Runtime Defaults
- Default URL:
http://localhost:4001/v1/chat/completions - Default auth token:
SCILLM_PROXY_TOKEN, thenCHUTES_PROXY_TOKEN, falling back tosk-dev-proxy-123for the local dev proxy. - Default model field:
chutes-deepseek; override with any scillm one-shot chat model such asgemini-flash,gpt-5.5, ormoonshot-textwhen auth is healthy. - Default temperature:
0 - Default streaming:
true - Default concurrency:
1 - Default retry ladder:
8-12, then5-8, then2-4records. - Default
max_tokens: omitted. Pass--max-tokens Nonly when a specific model/provider needs an explicit completion cap.
Output Artifacts
The output directory contains:
stage_manifest.jsonwhenstage-bookis usedcleaned_chapters/chapter_XX.txtwhenstage-bookis usedchunks.jsonlbook_progress.jsonlat a book output root when full-book orchestration callsbook-progressrun_manifest.jsonprogress.as_completed.jsonlaggregate_report.jsonaccepted_records.jsonlstatus.mdchunks/<chunk_id>/input_chunk_metadata.jsonchunks/<chunk_id>/accepted_records.jsonlchunks/<chunk_id>/attempt_XX/prompt_payload.jsonchunks/<chunk_id>/attempt_XX/stream.raw.ssechunks/<chunk_id>/attempt_XX/model_content.raw.jsonlchunks/<chunk_id>/attempt_XX/validation_report.json
For full-book audiobook pipelines, the primary durable JSONL artifacts are:
chapters.jsonl: one chapter provenance record per embedded audio chapter. Rows are shaped for later$memory/upsertintobook_chapters.chunks.jsonl: one fact-extraction chunk record per primary/context span, enriched withbook,book_id,chapter, andchapter_id. Rows are shaped for later$memory/upsertintobook_chunks.accepted_records.jsonl: one validated extraction candidate per accepted fact/QRA-like record, merged from per-chapter fact extraction outputs. Rows are shaped for later$memory/upsertintopersona_memorywhen--persona-idis supplied.
Store full-book artifacts under /mnt/storage12tb/skills/... by default. Use
/tmp only for disposable canaries, prompt experiments, or short proof runs.
The primary JSONL artifacts are memory-upsert-compatible, but this skill still
does not perform the write. Rows include deterministic _key, type,
record_type, memory_collection, text, retrieval_text, tags, source refs,
and status fields. Accepted fact rows also include persona_id,
question_text, answer_text, claim_text, and evidence_text. Rows must not
include inline embedding, embedding_visual, or vector fields. Later
ingestion should batch rows by memory_collection and write through /memory
/upsert so semantic sync and Qdrant metadata are handled by the memory
service.
The aggregate report always includes:
schema_versionacceptedcompleted_chunksaccepted_chunksfailed_chunkstotal_recordsaccepted_record_metadata_okaccepted_record_metadata_defectsmemory_writes_performed: falseforbidden_writes: ["persona_memory", "lessons", "arangodb"]
The book-level progress artifact is append-only JSONL. Each row has
schema_version: fact-extractor-book-progress.v1, book, book_id,
chapter, chapter_id, status, chunk/record counts, timestamp fields,
artifact paths, memory_writes_performed: false, and the same forbidden-write
list. It is a monitoring artifact only and must not replace per-chapter
progress.as_completed.jsonl.
Relationship To Other Skills
/create-qrascreates canonical QRA records. This skill creates validated source-grounded extraction candidates.- A later normalizer can convert accepted candidates to canonical QRA or
persona_memoryschemas. - A later memory upserter can ingest reviewed records. This skill must not perform that write.