adpa-inline-entity-extraction - SKILL.md Agent Skill

name: adpa-inline-entity-extraction description: > Use when working on document generation — specifically when modifying `documentGenerationService.ts`, the `InlineEntityParserService`, the `draftSection` prompt, or the entity registry. Use when asked to add new entity types to inline extraction, change the H8 tag format, alter how entities are saved during generation, or debug missing entities after generation. triggers: - Modifying draftSection() in documentGenerationService.ts - Adding a new entity type to inline extraction - Changing the H8 tag format or parsing logic - Debugging missing entities after document generation - Working with InlineEntityParserService - Editing ExtractionRegistry or ExtractionOrchestrator

ADPA Inline Entity Extraction

What Inline Entity Extraction Is

During document generation the drafting LLM is explicitly instructed — inside the draftSection() prompt — to append structured entity tags at the bottom of every section it writes.

Tags use a reserved Markdown heading level (H8) as a sentinel format:

######## entity_type: {json_object}

After each section is drafted, InlineEntityParserService.parseAndProcess() reads those lines, groups the parsed objects by entity type, and persists them to the database through the savers registered in the ExtractionRegistry.

The H8 lines are deliberately preserved in the final saved document. They are NOT stripped. Frontend renderers use them to highlight extracted entities inline, providing visual lineage back to the source passage.

Why This Approach

Property	Detail
Zero extra LLM calls	Extraction is a byproduct of generation — not a separate batch job.
Zero latency overhead	Entities land in the database as each section completes, not after the full document.
High fidelity	The model tags what it actually wrote, not a post-hoc re-read.
Full traceability	H8 lines stay in the document for visual lineage and audit.

Key Files

File	Purpose
`server/src/services/documentGenerationService.ts`	`draftSection()` prompt construction and `generateDocument()` section-mapping loop
`server/src/services/inlineEntityParserService.ts`	`InlineEntityParserService.parseAndProcess()` — parses H8 lines and dispatches to savers
`server/src/services/extraction/ExtractionRegistry.ts`	`initializeRegistry()` — entity module registration
`server/src/services/extraction/ExtractionOrchestrator.ts`	`saveSingleEntityType()` — transactional persistence per entity type
`server/src/__tests__/services/inlineEntityParserService.test.ts`	Unit tests for the parser

The H8 Tag Format

######## entity_type: {"attribute": "value", ...}

Each H8 line encodes exactly one entity object. Multiple entities of the same type require multiple lines. The entity_type token must match a key registered in the ExtractionRegistry.

Examples

######## stakeholders: {"name": "Project Sponsor", "role": "Funder", "influence_level": "high"}
######## stakeholders: {"name": "IT Lead", "role": "Technical Authority", "influence_level": "medium"}
######## risks: {"title": "Budget overrun", "category": "budget", "probability": "medium", "impact": "high"}
######## milestones: {"name": "Phase 1 Go-Live", "description": "Core platform live", "due_date": "2025-Q3", "status": "pending"}

Parsing rules

Lines are identified by the ######## prefix (8 # characters followed by a space).
Everything after the first : (and its trailing space) is parsed as JSON.
Lines that fail JSON parsing are silently skipped; the section text is never corrupted.
The parser runs after section drafting but before document assembly.

Supported Entity Types and Their Schemas

`stakeholders`

Field	Type	Required	Notes
`name`	string	✅	Display name
`role`	string	✅	Project role or title
`interest_level`	`high` \| `medium` \| `low`	✅
`influence_level`	`high` \| `medium` \| `low`	✅
`expectations`	string	❌	What the stakeholder expects
`concerns`	string	❌	Known concerns or risks

`risks`

Field	Type	Required	Notes
`title`	string	✅	Short risk title
`description`	string	✅	Full description
`category`	`technical` \| `schedule` \| `budget` \| `resource` \| `external` \| `quality`	✅
`probability`	`high` \| `medium` \| `low`	✅
`impact`	`high` \| `medium` \| `low`	✅
`mitigation_strategy`	string	❌
`contingency_plan`	string	❌

`milestones`

Field	Type	Required	Notes
`name`	string	✅	Milestone name
`description`	string	✅	What it represents
`due_date`	string	✅	`YYYY-MM-DD` or `Quarter/Year` (e.g. `2025-Q3`)
`status`	`pending` \| `in_progress` \| `completed` \| `delayed`	✅

`budget_baseline`

Field	Type	Required	Notes
`total_budget`	number	✅	Raw numeric value
`currency`	string	❌	ISO 4217 code (e.g. `USD`)
`categories`	object	❌	Free-form breakdown object

`cost_estimates`

Field	Type	Required	Notes
`item_name`	string	✅	Cost line item
`estimated_cost`	number	✅	Raw numeric value
`basis_of_estimate`	string	❌	How the estimate was derived
`confidence_level`	string	❌	e.g. `high`, `medium`, `low`

`deliverables`

Field	Type	Required	Notes
`name`	string	✅	Deliverable name
`description`	string	✅	What is delivered
`type`	`document` \| `software` \| `hardware` \| `service` \| `report` \| `other`	✅
`status`	`planned` \| `in_progress` \| `completed` \| `delayed` \| `cancelled`	✅
`due_date`	string	❌	`YYYY-MM-DD` or quarter
`owner`	string	❌	Person or team responsible

Adding a New Entity Type

Follow this checklist exactly. Do not skip steps.

Step 1 — Create the entity module

Create four files under server/src/services/extraction/entities/<entity_type>/:

server/src/services/extraction/entities/<entity_type>/
  types.ts        ← TypeScript interface for the entity
  extract<X>.ts   ← validation / normalization logic
  save<X>.ts      ← DB persistence using the pool/transaction
  index.ts        ← re-exports + EntityModule object

types.ts — define the interface:

export interface MyEntity {
  // required fields
  name: string;
  // optional fields
  description?: string;
}

extract<X>.ts — validate raw JSON from the parser:

import { MyEntity } from './types';

export function extractMyEntities(raw: unknown[]): MyEntity[] {
  return raw.filter((item): item is MyEntity =>
    typeof item === 'object' && item !== null && 'name' in item
  );
}

save<X>.ts — persist to the database:

import { PoolClient } from 'pg';
import { MyEntity } from './types';

export async function saveMyEntities(
  client: PoolClient,
  documentId: string,
  entities: MyEntity[]
): Promise<void> {
  for (const entity of entities) {
    await client.query(
      `INSERT INTO my_entities (document_id, name, description)
       VALUES ($1, $2, $3)
       ON CONFLICT DO NOTHING`,
      [documentId, entity.name, entity.description ?? null]
    );
  }
}

index.ts — assemble the module:

import { EntityModule } from '../../ExtractionRegistry';
import { extractMyEntities } from './extractMyEntities';
import { saveMyEntities } from './saveMyEntities';

export const myEntityModule: EntityModule = {
  entityType: 'my_entities',   // must match the H8 tag token
  extract: extractMyEntities,
  save: saveMyEntities,
};

Step 2 — Register in ExtractionRegistry

In initializeRegistry() inside ExtractionRegistry.ts:

import { myEntityModule } from './entities/my_entities';

export function initializeRegistry(): void {
  // ... existing registrations ...
  registry.register(myEntityModule);
}

Step 3 — Add to the draftSection() prompt

In documentGenerationService.ts, inside draftSection(), append the new entity type and its JSON schema to the prompt's entity-tag instruction block:

######## my_entities: {"name": "...", "description": "..."}

Include at least one well-formed example so the model learns the expected shape.

Step 4 — Write unit tests

Add cases to inlineEntityParserService.test.ts covering:

Valid H8 line → parsed and saved correctly
Missing required field → entity is skipped gracefully
Malformed JSON → parser recovers, section text unchanged

Critical Rules

[!IMPORTANT] These rules must never be violated, even under time pressure.

Never strip H8 lines from the saved document content. They are stored for frontend visual lineage. Removing them breaks the UI highlighting feature.
Never call saveSingleEntityType() outside a transaction context. The orchestrator manages BEGIN / COMMIT / ROLLBACK internally. Calling it bare will leave the connection in an inconsistent state.
If the parser fails for a section, it falls back silently. The original markdown is preserved and document generation continues. Entity extraction is non-blocking — a saver failure must never abort the overall generation job.
The inline parser runs AFTER section drafting but BEFORE document assembly. Do not reorder this pipeline. Entities must be persisted per-section so partial results survive if generation is interrupted.
One JSON object per H8 line. Do not emit arrays or multi-line JSON inside a single H8 line. The parser processes lines individually.
Retain H8 tags in the 100% full document. The final 100% complete document (documentText100 or the main stored document text) must retain all ######## tags verbatim in their original spots. Never strip or truncate them.
Scrub H8 tags in the context summaries. Multi-scale recursive context summaries (summary80, summary60, summary40, summary20) must never contain ######## tags. They must be completely scrubbed of H8 lines to save tokens.
Preserve entity terms in summaries. In all summaries, although the H8 prefix tags are scrubbed, the actual entity names, milestones, and framework terms must be retained in the text narrative to ensure semantic connectivity.

Multi-Scale Context Compaction (Summaries)

During final compilation (Phase 6), the generation service produces the full document along with four recursive context compression tiers. These summaries must follow strict density and formatting rules:

Density Level	Target Field	Key Rules
100% Full Document	`documentText100`	MUST retain every single H8 Entity Tag verbatim. No truncation or flimsiness.
80% Summary	`summary80`	Eliminate narrative fluff, preserve core metrics, omit H8 tags (but keep entity names).
60% Summary	`summary60`	Tighter compression, preserve critical entities, omit H8 tags.
40% Summary	`summary40`	Focus on structural boundaries, omit H8 tags.
20% Summary	`summary20`	High-density core capsule optimized for token-starved injections, omit H8 tags.

Principles of Context Compression:

Fluff Elimination: Remove narrative filler, conversational transitions, and introductory/concluding remarks.
Density Increase: Every sentence must pack critical technical info, milestones, budget figures, and stakeholder roles.
Traceability: Keep the H8 tags intact in the 100% document; they are the sole anchors for frontend line highlight mapping.

Debugging Missing Entities

If entities are absent from the database after generation:

Check the raw section content — confirm the LLM actually emitted H8 lines. If not, the prompt in draftSection() may be missing or malformed.
Check the entity type token — the token after ######## must exactly match a key registered in the ExtractionRegistry. Case and underscore differences will cause silent drops.
Check JSON validity — malformed JSON is silently skipped. Enable debug logging in InlineEntityParserService to surface parse errors.
Check the saver — run the unit tests for the specific entity module. A constraint violation in the DB insert can silently swallow entities if the error is caught at the orchestrator level.
Check transaction isolation — if saveSingleEntityType() was called outside a transaction, the ROLLBACK on a later error may have undone earlier successful inserts.

Data Flow Summary

draftSection() prompt
       │
       ▼
   LLM drafts section text
   + appends H8 entity tags
       │
       ▼
InlineEntityParserService.parseAndProcess()
   ├─ splits lines on ########
   ├─ groups by entity_type
   └─ for each type → ExtractionOrchestrator.saveSingleEntityType()
             │
             ▼
         DB transaction
         (BEGIN / INSERT / COMMIT)
       │
       ▼
Section text (H8 lines preserved) stored in document
       │
       ▼
generateDocument() assembles full document
       │
       ▼
Frontend renders H8 lines as entity highlights