name: ai-observability description: Use when designing AI agent observability architecture, integrating mission-control dashboards, or defining AI-specific KPI schemas version: "1.1" owner: platform-governance tier: full source: .enterprise/governance/agent-skills/ai-observability/SKILL.md quick: .enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md portable: true
AI Observability — Full Reference
Tier 2: complete observability model, metric schemas, mission-control integration, and KPI dashboard. Source: AI-SDLC v1.0 §9 (Observabilidade), §10 (KPIs), mission-control API (see
/opt/references/mission-control/).
1. Metric Schema (AI-SDLC §9)
Full metric set as defined by AI-SDLC:
metrics:
session:
tokens_input: integer # tokens in context window at session start
tokens_output: integer # tokens generated by model in session
context_usage_pct: float # context_used / context_max × 100
session_duration_s: integer # wall clock seconds
task:
task_id: string
agent: string # GHOST, ORBIT, KUBE, etc.
tasks_executed: integer
execution_time_s: integer
status: success | failure | partial
rework: boolean # true if task was re-executed after failure
workflow:
epic_id: string
phases_completed: integer
total_phases: integer
gate_failures: integer
gate_warnings: integer
cycle_time_s: integer # phase_0.start → phase_10.end
cost:
model: string # claude-sonnet-4-6, etc.
tokens_total: integer # tokens_input + tokens_output
cost_usd: float # tokens_total × model_price
2. Native HSEOS Metrics Collection
2.1 Workflow state file
Path: .hseos-output/<epic-id>/state.yaml
Extract delivery metrics:
# Cycle time
START=$(yq '.phases.preflight.started_at' state.yaml)
END=$(yq '.phases.consolidation.completed_at' state.yaml)
echo "Cycle time: $(( END - START )) seconds"
# Gate failures across epic
grep -r "FAILURES" .logs/validation/ | grep -v "FAILURES : 0" | wc -l
2.2 Quality gate logs
Path: .logs/validation/gate-<timestamp>.log
# Aggregate gate results
PASSES=$(grep "PASS" .logs/validation/*.log | wc -l)
FAILS=$(grep "FAIL\b" .logs/validation/*.log | wc -l)
echo "Gate failure rate: $(echo "scale=2; $FAILS / ($FAILS + $PASSES)" | bc)"
2.3 Story throughput
# Completed tasks in epic
grep -c "^\- \[x\]" .specs/features/<feature>/tasks.md
3. mission-control Integration
3.1 What mission-control provides
Mission-control (/opt/references/mission-control/) is an AI agent orchestration dashboard with:
- Agent registry and heartbeat tracking
- Task assignment and status reporting
- Skill registry
- Real-time SSE/WebSocket events
- REST API at
http://localhost:3000(default)
Auth: x-api-key: $MC_API_KEY
3.2 Installation
cd /opt/references/mission-control
npm install
cp .env.example .env # configure MC_API_KEY, DB path
npm run dev # or docker-compose up
3.3 SABLE integration — register + report
SABLE registers itself and reports metrics at the end of each workflow phase:
# Register SABLE at workflow start
curl -X POST http://localhost:3000/api/adapters \
-H "x-api-key: $MC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"framework": "generic",
"action": "register",
"payload": {
"agentId": "sable-01",
"name": "SABLE",
"metadata": { "epic": "<epic-id>", "capabilities": ["runtime-verify", "finops-audit"] }
}
}'
# Report task completion (per phase)
curl -X POST http://localhost:3000/api/adapters \
-H "x-api-key: $MC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"framework": "generic",
"action": "task_complete",
"payload": {
"agentId": "sable-01",
"taskId": "phase-9-runtime-verify",
"status": "success",
"metrics": {
"gate_failures": 0,
"cycle_time_s": 1240,
"tasks_executed": 3
}
}
}'
3.4 ORBIT integration — workflow orchestration events
ORBIT emits phase transitions to mission-control:
# On each phase completion
curl -X POST http://localhost:3000/api/adapters \
-H "x-api-key: $MC_API_KEY" \
-d '{ "framework": "generic", "action": "heartbeat",
"payload": { "agentId": "orbit-01", "status": "online",
"metrics": { "current_phase": 8, "phases_completed": 7 } } }'
4. KPI Dashboard (AI-SDLC §10)
4.1 KPIs available today (no mission-control)
| KPI | Target | How to measure |
|---|---|---|
| Gate failure rate | < 5% per epic | .logs/validation/ aggregation |
| Delivery cycle time | Trending down | Workflow state timestamps |
| Story completion rate | > 95% per sprint | tasks.md [x] count |
4.2 KPIs with mission-control
| KPI | Target | How to measure |
|---|---|---|
| Cost per feature (USD) | Trending down | tokens_total × model_price |
| % stateless execution | > 90% | Sessions without history dependency |
| Context budget adherence | > 80% sessions within 60% | context_usage_pct ≤ 60 |
| Rework rate | < 10% | Tasks re-executed after failure |
| Average tasks/session | Stable | tasks_executed per session |
| Error rate per agent | < 2% | Failed tasks per agent role |
4.3 Suggested dashboard layout
┌─ AI-SDLC Dashboard ──────────────────────────────────────────┐
│ │
│ DELIVERY COST │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Cycle time │ │ Cost/feature │ │
│ │ Gate fail rate │ │ Tokens/session │ │
│ │ Story rate │ │ Cost trend │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ CONTEXT QUALITY │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Budget adherence│ │ % stateless │ │
│ │ Avg context % │ │ Rework rate │ │
│ │ Sessions >60% │ │ Error rate │ │
│ └─────────────────┘ └─────────────────┘ │
└───────────────────────────────────────────────────────────────┘
5. Escalation
Escalate to SABLE (governance audit) when:
- Gate failure rate > 10% in a sprint
- Any session exceeds 60% context (without middleware enforcement)
- Cost per feature increases > 20% week-over-week
- Rework rate > 15% (indicates task contracts are too loose)
6. Telemetry Export Bridge (OTLP / Loki)
ADR-0014 (telemetry export bridge). Cross-reference:
observability-complianceskill, ADR-0010 (shared collector).
SQLite (via state-emit-hook.sh) is the canonical observability sink. OTLP/Loki is an ADDITIONAL opt-in sink — a TEE that mirrors events to an external collector without affecting the primary state record.
6.1 Architecture
Tool event
│
├─► state-emit-hook.sh → SQLite (CANONICAL — always active)
│
└─► telemetry-export-*.sh → OTLP / Loki (OPT-IN — env-gated)
The shared OTLP/Loki collector referenced by ADR-0010 (platform-shared-dev namespace in k3s, or shared-otel-collector / shared-loki containers locally) is the intended downstream receiver.
6.2 Opt-in environment variables
| Variable | Purpose | Example |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
Standard OTel env — enables OTLP export for both metrics and logs | http://localhost:4318 |
HSEOS_LOKI_ENDPOINT |
Loki push API endpoint — used by telemetry-export-session.sh when OTLP is not set |
http://localhost:3100 |
HSEOS_OTEL_EXPORT |
HSEOS convenience alias — set to 1 to enable OTLP metrics even without the standard env var |
1 |
HSEOS_ENV |
Deployment environment label attached to every emitted metric/log record | dev (default) |
All four variables are optional. When none are set, the telemetry handlers exit immediately with zero network calls.
6.3 Handler map
| Handler | Event | Emits | Endpoint |
|---|---|---|---|
.agents/hooks/handlers/telemetry-export-tool.sh |
PostToolUse | OTLP resourceMetrics (claude_tool_use_total, claude_tool_duration_ms) |
$OTEL_EXPORTER_OTLP_ENDPOINT/v1/metrics |
.agents/hooks/handlers/telemetry-export-session.sh |
Stop | OTLP resourceLogs or Loki push JSON (session_ended) |
$OTEL_EXPORTER_OTLP_ENDPOINT/v1/logs or $HSEOS_LOKI_ENDPOINT/loki/api/v1/push |
6.4 Activation
Both telemetry handlers are status: active in .agents/hooks/registry.yaml and compiled into .claude/hooks.json. They are inert by default — setting any of the opt-in env vars above activates the export path without requiring a recompile.
Quick Mode
For low-context activation, load .enterprise/governance/agent-skills/ai-observability/SKILL-QUICK.md or QUICK.md first. Load this full skill for deep analysis, violation fixing, or formal review gates.