ai-observability

star 1

Use when designing AI agent observability architecture, integrating mission-control dashboards, or defining AI-specific KPI schemas

marciohideaki By marciohideaki schedule Updated 6/3/2026

name: ai-observability tier: full version: "1.1" description: "Use when designing AI agent observability architecture, integrating mission-control dashboards, or defining AI-specific KPI schemas"

AI Observability — Full Reference

Tier 2: complete observability model, metric schemas, mission-control integration, and KPI dashboard. Source: AI-SDLC v1.0 §9 (Observabilidade), §10 (KPIs), mission-control API (see /opt/references/mission-control/).


1. Metric Schema (AI-SDLC §9)

Full metric set as defined by AI-SDLC:

metrics:
  session:
    tokens_input: integer        # tokens in context window at session start
    tokens_output: integer       # tokens generated by model in session
    context_usage_pct: float     # context_used / context_max × 100
    session_duration_s: integer  # wall clock seconds

  task:
    task_id: string
    agent: string                # GHOST, ORBIT, KUBE, etc.
    tasks_executed: integer
    execution_time_s: integer
    status: success | failure | partial
    rework: boolean              # true if task was re-executed after failure

  workflow:
    epic_id: string
    phases_completed: integer
    total_phases: integer
    gate_failures: integer
    gate_warnings: integer
    cycle_time_s: integer        # phase_0.start → phase_10.end

  cost:
    model: string                # claude-sonnet-4-6, etc.
    tokens_total: integer        # tokens_input + tokens_output
    cost_usd: float              # tokens_total × model_price

2. Native HSEOS Metrics Collection

2.1 Workflow state file

Path: .hseos-output/<epic-id>/state.yaml

Extract delivery metrics:

# Cycle time
START=$(yq '.phases.preflight.started_at' state.yaml)
END=$(yq '.phases.consolidation.completed_at' state.yaml)
echo "Cycle time: $(( END - START )) seconds"

# Gate failures across epic
grep -r "FAILURES" .logs/validation/ | grep -v "FAILURES : 0" | wc -l

2.2 Quality gate logs

Path: .logs/validation/gate-<timestamp>.log

# Aggregate gate results
PASSES=$(grep "PASS" .logs/validation/*.log | wc -l)
FAILS=$(grep "FAIL\b" .logs/validation/*.log | wc -l)
echo "Gate failure rate: $(echo "scale=2; $FAILS / ($FAILS + $PASSES)" | bc)"

2.3 Story throughput

# Completed tasks in epic
grep -c "^\- \[x\]" .specs/features/<feature>/tasks.md

3. mission-control Integration

3.1 What mission-control provides

Mission-control (/opt/references/mission-control/) is an AI agent orchestration dashboard with:

  • Agent registry and heartbeat tracking
  • Task assignment and status reporting
  • Skill registry
  • Real-time SSE/WebSocket events
  • REST API at http://localhost:3000 (default)

Auth: x-api-key: $MC_API_KEY

3.2 Installation

cd /opt/references/mission-control
npm install
cp .env.example .env   # configure MC_API_KEY, DB path
npm run dev            # or docker-compose up

3.3 SABLE integration — register + report

SABLE registers itself and reports metrics at the end of each workflow phase:

# Register SABLE at workflow start
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "framework": "generic",
    "action": "register",
    "payload": {
      "agentId": "sable-01",
      "name": "SABLE",
      "metadata": { "epic": "<epic-id>", "capabilities": ["runtime-verify", "finops-audit"] }
    }
  }'

# Report task completion (per phase)
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "framework": "generic",
    "action": "task_complete",
    "payload": {
      "agentId": "sable-01",
      "taskId": "phase-9-runtime-verify",
      "status": "success",
      "metrics": {
        "gate_failures": 0,
        "cycle_time_s": 1240,
        "tasks_executed": 3
      }
    }
  }'

3.4 ORBIT integration — workflow orchestration events

ORBIT emits phase transitions to mission-control:

# On each phase completion
curl -X POST http://localhost:3000/api/adapters \
  -H "x-api-key: $MC_API_KEY" \
  -d '{ "framework": "generic", "action": "heartbeat",
        "payload": { "agentId": "orbit-01", "status": "online",
                     "metrics": { "current_phase": 8, "phases_completed": 7 } } }'

4. KPI Dashboard (AI-SDLC §10)

4.1 KPIs available today (no mission-control)

KPI Target How to measure
Gate failure rate < 5% per epic .logs/validation/ aggregation
Delivery cycle time Trending down Workflow state timestamps
Story completion rate > 95% per sprint tasks.md [x] count

4.2 KPIs with mission-control

KPI Target How to measure
Cost per feature (USD) Trending down tokens_total × model_price
% stateless execution > 90% Sessions without history dependency
Context budget adherence > 80% sessions within 60% context_usage_pct ≤ 60
Rework rate < 10% Tasks re-executed after failure
Average tasks/session Stable tasks_executed per session
Error rate per agent < 2% Failed tasks per agent role

4.3 Suggested dashboard layout

┌─ AI-SDLC Dashboard ──────────────────────────────────────────┐
│                                                               │
│  DELIVERY                     COST                           │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │ Cycle time      │          │ Cost/feature    │           │
│  │ Gate fail rate  │          │ Tokens/session  │           │
│  │ Story rate      │          │ Cost trend      │           │
│  └─────────────────┘          └─────────────────┘           │
│                                                               │
│  CONTEXT                      QUALITY                        │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │ Budget adherence│          │ % stateless     │           │
│  │ Avg context %   │          │ Rework rate     │           │
│  │ Sessions >60%   │          │ Error rate      │           │
│  └─────────────────┘          └─────────────────┘           │
└───────────────────────────────────────────────────────────────┘

5. Escalation

Escalate to SABLE (governance audit) when:

  • Gate failure rate > 10% in a sprint
  • Any session exceeds 60% context (without middleware enforcement)
  • Cost per feature increases > 20% week-over-week
  • Rework rate > 15% (indicates task contracts are too loose)

6. Telemetry Export Bridge (OTLP / Loki)

ADR-0014 (telemetry export bridge). Cross-reference: observability-compliance skill, ADR-0010 (shared collector).

SQLite (via state-emit-hook.sh) is the canonical observability sink. OTLP/Loki is an ADDITIONAL opt-in sink — a TEE that mirrors events to an external collector without affecting the primary state record.

6.1 Architecture

Tool event
    │
    ├─► state-emit-hook.sh  →  SQLite (CANONICAL — always active)
    │
    └─► telemetry-export-*.sh  →  OTLP / Loki (OPT-IN — env-gated)

The shared OTLP/Loki collector referenced by ADR-0010 (platform-shared-dev namespace in k3s, or shared-otel-collector / shared-loki containers locally) is the intended downstream receiver.

6.2 Opt-in environment variables

Variable Purpose Example
OTEL_EXPORTER_OTLP_ENDPOINT Standard OTel env — enables OTLP export for both metrics and logs http://localhost:4318
HSEOS_LOKI_ENDPOINT Loki push API endpoint — used by telemetry-export-session.sh when OTLP is not set http://localhost:3100
HSEOS_OTEL_EXPORT HSEOS convenience alias — set to 1 to enable OTLP metrics even without the standard env var 1
HSEOS_ENV Deployment environment label attached to every emitted metric/log record dev (default)

All four variables are optional. When none are set, the telemetry handlers exit immediately with zero network calls.

6.3 Handler map

Handler Event Emits Endpoint
.agents/hooks/handlers/telemetry-export-tool.sh PostToolUse OTLP resourceMetrics (claude_tool_use_total, claude_tool_duration_ms) $OTEL_EXPORTER_OTLP_ENDPOINT/v1/metrics
.agents/hooks/handlers/telemetry-export-session.sh Stop OTLP resourceLogs or Loki push JSON (session_ended) $OTEL_EXPORTER_OTLP_ENDPOINT/v1/logs or $HSEOS_LOKI_ENDPOINT/loki/api/v1/push

6.4 Activation

Both telemetry handlers are status: active in .agents/hooks/registry.yaml and compiled into .claude/hooks.json. They are inert by default — setting any of the opt-in env vars above activates the export path without requiring a recompile.

Install via CLI
npx skills add https://github.com/marciohideaki/enterprise-hseos --skill ai-observability
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
marciohideaki
marciohideaki Explore all skills →