ai-observability - SKILL.md Agent Skill

name: ai-observability description: Use when designing AI agent observability architecture, integrating mission-control dashboards, or defining AI-specific KPI schemas version: "1.1" owner: platform-governance tier: full source: .enterprise/governance/agent-skills/ai-observability/SKILL.md quick: .enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md portable: true

AI Observability — Full Reference

Tier 2: complete observability model, metric schemas, mission-control integration, and KPI dashboard. Source: AI-SDLC v1.0 §9 (Observabilidade), §10 (KPIs), mission-control API (see /opt/references/mission-control/).

1. Metric Schema (AI-SDLC §9)

Full metric set as defined by AI-SDLC:

metrics:
  session:
    tokens_input: integer        # tokens in context window at session start
    tokens_output: integer       # tokens generated by model in session
    context_usage_pct: float     # context_used / context_max × 100
    session_duration_s: integer  # wall clock seconds

  task:
    task_id: string
    agent: string                # GHOST, ORBIT, KUBE, etc.
    tasks_executed: integer
    execution_time_s: integer
    status: success | failure | partial
    rework: boolean              # true if task was re-executed after failure

  workflow:
    epic_id: string
    phases_completed: integer
    total_phases: integer
    gate_failures: integer
    gate_warnings: integer
    cycle_time_s: integer        # phase_0.start → phase_10.end

  cost:
    model: string                # claude-sonnet-4-6, etc.
    tokens_total: integer        # tokens_input + tokens_output
    cost_usd: float              # tokens_total × model_price

2. Native HSEOS Metrics Collection

2.1 Workflow state file

Path: .hseos-output/<epic-id>/state.yaml

Extract delivery metrics:

# Cycle time
START=$(yq '.phases.preflight.started_at' state.yaml)
END=$(yq '.phases.consolidation.completed_at' state.yaml)
echo "Cycle time: $(( END - START )) seconds"

# Gate failures across epic
grep -r "FAILURES" .logs/validation/ | grep -v "FAILURES : 0" | wc -l

2.2 Quality gate logs

Path: .logs/validation/gate-<timestamp>.log

# Aggregate gate results
PASSES=$(grep "PASS" .logs/validation/*.log | wc -l)
FAILS=$(grep "FAIL\b" .logs/validation/*.log | wc -l)
echo "Gate failure rate: $(echo "scale=2; $FAILS / ($FAILS + $PASSES)" | bc)"

2.3 Story throughput

# Completed tasks in epic
grep -c "^\- \[x\]" .specs/features/<feature>/tasks.md

3. mission-control Integration

3.1 What mission-control provides

Mission-control (/opt/references/mission-control/) is an AI agent orchestration dashboard with:

Agent registry and heartbeat tracking
Task assignment and status reporting
Skill registry
Real-time SSE/WebSocket events
REST API at http://localhost:3000 (default)

Auth: x-api-key: $MC_API_KEY

3.2 Installation

cd /opt/references/mission-control
npm install
cp .env.example .env   # configure MC_API_KEY, DB path
npm run dev            # or docker-compose up

3.3 SABLE integration — register + report

SABLE registers itself and reports metrics at the end of each workflow phase:

# Register SABLE at workflow start
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "framework": "generic",
    "action": "register",
    "payload": {
      "agentId": "sable-01",
      "name": "SABLE",
      "metadata": { "epic": "<epic-id>", "capabilities": ["runtime-verify", "finops-audit"] }
    }
  }'

# Report task completion (per phase)
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "framework": "generic",
    "action": "task_complete",
    "payload": {
      "agentId": "sable-01",
      "taskId": "phase-9-runtime-verify",
      "status": "success",
      "metrics": {
        "gate_failures": 0,
        "cycle_time_s": 1240,
        "tasks_executed": 3
      }
    }
  }'

3.4 ORBIT integration — workflow orchestration events

ORBIT emits phase transitions to mission-control:

# On each phase completion
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -d '{ "framework": "generic", "action": "heartbeat",
        "payload": { "agentId": "orbit-01", "status": "online",
                     "metrics": { "current_phase": 8, "phases_completed": 7 } } }'

4. KPI Dashboard (AI-SDLC §10)

4.1 KPIs available today (no mission-control)

KPI	Target	How to measure
Gate failure rate	< 5% per epic	`.logs/validation/` aggregation
Delivery cycle time	Trending down	Workflow state timestamps
Story completion rate	> 95% per sprint	`tasks.md` `[x]` count

4.2 KPIs with mission-control

KPI	Target	How to measure
Cost per feature (USD)	Trending down	`tokens_total × model_price`
% stateless execution	> 90%	Sessions without history dependency
Context budget adherence	> 80% sessions within 60%	`context_usage_pct ≤ 60`
Rework rate	< 10%	Tasks re-executed after failure
Average tasks/session	Stable	`tasks_executed` per session
Error rate per agent	< 2%	Failed tasks per agent role

4.3 Suggested dashboard layout

┌─ AI-SDLC Dashboard ──────────────────────────────────────────┐
│                                                               │
│  DELIVERY                     COST                           │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │ Cycle time      │          │ Cost/feature    │           │
│  │ Gate fail rate  │          │ Tokens/session  │           │
│  │ Story rate      │          │ Cost trend      │           │
│  └─────────────────┘          └─────────────────┘           │
│                                                               │
│  CONTEXT                      QUALITY                        │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │ Budget adherence│          │ % stateless     │           │
│  │ Avg context %   │          │ Rework rate     │           │
│  │ Sessions >60%   │          │ Error rate      │           │
│  └─────────────────┘          └─────────────────┘           │
└───────────────────────────────────────────────────────────────┘

5. Escalation

Escalate to SABLE (governance audit) when:

Gate failure rate > 10% in a sprint
Any session exceeds 60% context (without middleware enforcement)
Cost per feature increases > 20% week-over-week
Rework rate > 15% (indicates task contracts are too loose)

6. Telemetry Export Bridge (OTLP / Loki)

ADR-0014 (telemetry export bridge). Cross-reference: observability-compliance skill, ADR-0010 (shared collector).

SQLite (via state-emit-hook.sh) is the canonical observability sink. OTLP/Loki is an ADDITIONAL opt-in sink — a TEE that mirrors events to an external collector without affecting the primary state record.

6.1 Architecture

Tool event
    │
    ├─► state-emit-hook.sh  →  SQLite (CANONICAL — always active)
    │
    └─► telemetry-export-*.sh  →  OTLP / Loki (OPT-IN — env-gated)

The shared OTLP/Loki collector referenced by ADR-0010 (platform-shared-dev namespace in k3s, or shared-otel-collector / shared-loki containers locally) is the intended downstream receiver.

6.2 Opt-in environment variables

Variable	Purpose	Example
`OTEL_EXPORTER_OTLP_ENDPOINT`	Standard OTel env — enables OTLP export for both metrics and logs	`http://localhost:4318`
`HSEOS_LOKI_ENDPOINT`	Loki push API endpoint — used by `telemetry-export-session.sh` when OTLP is not set	`http://localhost:3100`
`HSEOS_OTEL_EXPORT`	HSEOS convenience alias — set to `1` to enable OTLP metrics even without the standard env var	`1`
`HSEOS_ENV`	Deployment environment label attached to every emitted metric/log record	`dev` (default)

All four variables are optional. When none are set, the telemetry handlers exit immediately with zero network calls.

6.3 Handler map

Handler	Event	Emits	Endpoint
`.agents/hooks/handlers/telemetry-export-tool.sh`	PostToolUse	OTLP `resourceMetrics` (`claude_tool_use_total`, `claude_tool_duration_ms`)	`$OTEL_EXPORTER_OTLP_ENDPOINT/v1/metrics`
`.agents/hooks/handlers/telemetry-export-session.sh`	Stop	OTLP `resourceLogs` or Loki push JSON (`session_ended`)	`$OTEL_EXPORTER_OTLP_ENDPOINT/v1/logs` or `$HSEOS_LOKI_ENDPOINT/loki/api/v1/push`

6.4 Activation

Both telemetry handlers are status: active in .agents/hooks/registry.yaml and compiled into .claude/hooks.json. They are inert by default — setting any of the opt-in env vars above activates the export path without requiring a recompile.

Quick Mode

For low-context activation, load .enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md or QUICK.md first. Load this full skill for deep analysis, violation fixing, or formal review gates.