name: ring:creating-grafana-dashboards description: "Authoring Grafana dashboards for Lerian Go services from lib-observability telemetry (tracing, metrics, log), plus a reference mode for RED/USE panel patterns and Grafonnet templates. Sweep mode inventories telemetry, runs PM deliberation on themes/SLIs/alerts, authors Grafonnet libsonnet compiled to JSON, and installs a CI drift gate. Use when scaffolding dashboards. Skip when the service is non-Go or emits no telemetry."
Creating Grafana Dashboards (lib-observability, PM-team)
When to use
Sweep mode:
- "Create / scaffold Grafana dashboards for this service"
- "Inventory telemetry / build telemetry dictionary"
- "Audit observability before designing dashboards"
- "Produce dashboards as code for {service}"
- "PM wants visibility into {domain} — what dashboards do we need?"
Reference mode:
- "What's the right panel for HTTP request latency?"
- "RED vs USE methodology for this metric type?"
- "How do I compose Grafonnet panels?"
- "Which Grafonnet template fits a counter / histogram / gauge?"
Skip when
- Service is not a Go project (lib-observability is Go-only at this skill's scope)
- Service emits no telemetry (pre-instrumentation; instrument the service before dashboard authoring, then use ring:implementing-tasks to verify observability checks pass)
- Task is purely Grafana folder organization or dashboard import (no authoring)
- Service is consumer-only sidecar with no metrics surface
Sequence
Runs before: ring:running-dev-cycle, ring:running-dev-cycle-frontend
Related
Complementary: ring:implementing-tasks, ring:codebase-explorer, ring:mapping-streaming-events, ring:using-lib-observability, ring:using-tracing Similar: ring:using-runtime, ring:using-assert
Prerequisites
- Go service with lib-observability initialized in bootstrap (
tracing.NewTelemetry,metrics.NewFactory,zap.NewLogger) - At least one metric, span, or structured log emission point present
- docs/ directory writable
- Grafonnet toolchain available in CI (jsonnet + grafonnet-lib) — installer instructions in ci-drift-check.md
Orchestrates a 3-phase, 8-gate workflow to produce Grafana dashboards grounded in real telemetry. You orchestrate. Agents explore. PM iterates. You NEVER read, write, or edit source code directly during the sweep.
Announce at start: "Using ring:creating-grafana-dashboards through 8 gates (0–7)."
Mode Selection
| Request Shape | Mode |
|---|---|
| "Create / scaffold dashboards" / "build telemetry dictionary" | Sweep (run gates 0–7) |
| "Which panel for X?" / "RED vs USE?" / "Grafonnet template for Y?" | Reference (load sub-files/reference.md) |
SWEEP MODE
Telemetry Architecture (lib-observability)
Lerian Go services emit telemetry through github.com/LerianStudio/lib-observability:
- Tracing via
lib-observability/tracing—tracer.Start(ctx, name, opts...)returningcontext.Context, trace.Span - Metrics via
lib-observability/metrics— fluent factory producingmeter.Int64Counter,meter.Float64Histogram,meter.Int64UpDownCounter,meter.Int64ObservableGauge - Logs via
lib-observability/log(interface) andlib-observability/zap(implementation) — structured fields, automatically correlated with active span viatrace_id/span_id - OTel attribute / metric / event names via
lib-observability/constants— canonical string constants; dashboards reference these for label and metric names - Cross-cutting —
tenant_idpropagation through context, error attribution viaspan.RecordError+span.SetStatus
Deprecated shims:
lib-commons/v5/commons/{opentelemetry,zap,log,metrics}still compile but route through lib-observability. New emission sites MUST import lib-observability directly. The sweep detects both canonical and shim imports.
WebFetch canonical docs (lib-observability — develop branch; main has only LICENSE + README):
- Tracing:
https://raw.githubusercontent.com/LerianStudio/lib-observability/develop/tracing/doc.go - Metrics:
https://raw.githubusercontent.com/LerianStudio/lib-observability/develop/metrics/doc.go - Log:
https://raw.githubusercontent.com/LerianStudio/lib-observability/develop/log/doc.go - Constants:
https://raw.githubusercontent.com/LerianStudio/lib-observability/develop/constants/doc.go
WebFetch changelog: https://raw.githubusercontent.com/LerianStudio/lib-observability/develop/CHANGELOG.md
Authoring Format: Grafonnet (Mandatory)
Dashboards are authored as Grafonnet (Jsonnet templating language) — compiled to JSON in CI. Raw JSON dashboards are FORBIDDEN.
Reasons:
- Diffable in PR review (libsonnet is code-shaped, JSON is not)
- Composable via
importand inheritance - Templated panel patterns reusable across themes
- Single source of truth — JSON is a build artifact, not a checked-in source
Toolchain setup: sub-files/ci-drift-check.md. Panel templates: sub-files/grafonnet-templates/.
Theme Taxonomy
Free-form per service. PM defines the theme directories under docs/dashboards/{theme}/ during Gate 5. No enforced taxonomy — Lerian services are observability islands and theme naming reflects each service's domain.
Common-but-not-mandatory examples: transactions/, auth/, ledger/, infrastructure/, business-kpis/, sla/. The skill SUGGESTS themes from dictionary contents in Gate 4; PM ACCEPTS, RENAMES, MERGES, or SPLITS in Gate 5.
Drift Gate Posture
CI drift detection is BLOCKING from day 1. Any divergence between regenerated dictionary and committed telemetry-dictionary.md fails the PR. This is a deliberate cold-start choice — the skill is greenfield, no installed base to retrofit, every new metric emits under the strict regime.
Drift gate spec: sub-files/ci-drift-check.md.
Gate Overview
| Gate | Name | Agent | Cadence |
|---|---|---|---|
| 0 | Stack Detection | Orchestrator (grep + read) | Once per run |
| 1 | Telemetry Sweep (7 angles) | ring:codebase-explorer × 7 parallel | Once per run |
| 2 | Dictionary Assembly + Validation | Orchestrator (deterministic merge) | Once per run |
| 3 | Dictionary Rendering | Orchestrator → markdown writer | Once per run |
| 4 | Theme Proposal + Dashboard Plans | Orchestrator (LLM opinion via reference.md) | Once per run |
| 5 | PM Iteration — NEVER SKIPPABLE | User (PM team) | Loops until APPROVED |
| 6 | Grafonnet Authoring | ring:backend-go per theme | Per approved theme |
| 7 | CI Drift Gate Setup | Orchestrator | Once (idempotent) |
Gates execute sequentially. Gate 1 parallelizes internally across 7 angles. Gate 6 parallelizes per approved theme.
Gate 0: Stack Detection
Orchestrator executes directly. Detect in parallel:
1. Go version: grep "^go " go.mod | head -1
2. lib-observability version: grep "lib-observability" go.mod
3. lib-commons version: grep "lib-commons" go.mod
4. Tracing package present: grep -rn "lib-observability/tracing\|lib-commons/v5/commons/opentelemetry" internal/ cmd/ # canonical + deprecated shim
5. Metrics package present: grep -rn "lib-observability/metrics\|lib-commons/v5/commons/opentelemetry" internal/ cmd/ # canonical + deprecated shim
6. Meter init: grep -rn "Meter(\|NewMeter\|meter.Int64Counter\|meter.Float64Histogram" internal/ cmd/
7. Tracer init: grep -rn "Tracer(\|NewTracer\|tracer.Start" internal/ cmd/
8. Log emission: grep -rn "lib-observability/log\|lib-observability/zap\|lib-commons/v5/commons/log\|lib-commons/v5/commons/zap" internal/ cmd/ # canonical + deprecated shim
9. HTTP framework: grep -rn "gofiber/fiber\|labstack/echo\|gin-gonic" go.mod
10. gRPC server: grep -rn "grpc.NewServer" internal/ cmd/
11. RabbitMQ command consumers: grep -rn "lib-commons/v5/commons/rabbitmq" internal/ cmd/ # command queues; event emission goes through lib-streaming
12. lib-streaming present: grep "lib-streaming" go.mod
13. Tenant source: grep -rn "tmcore.GetTenantIDContext\|GetTenantID" internal/
14. Existing dictionary: test -f docs/dashboards/telemetry-dictionary.md
15. Existing dashboards: ls docs/dashboards/ 2>/dev/null
16. Grafonnet in CI: test -f .github/workflows/telemetry-drift.yml
17. Service identity: cat go.mod | grep "^module"
Emit /tmp/dashboards-recon.json:
{
"service_name": "...",
"go_version": "...",
"lib_observability_version": "...",
"lib_commons_version": "...",
"lib_streaming_present": false,
"tracing_initialized": true,
"metrics_initialized": true,
"metric_emission_present": true,
"trace_emission_present": true,
"structured_log_present": true,
"deprecated_shim_imports": ["lib-commons/v5/commons/opentelemetry"],
"http_framework": "fiber|echo|gin|none",
"grpc_server_present": true,
"rabbitmq_command_consumers_present": true,
"tenant_source": "tmcore.GetTenantIDContext",
"existing_dictionary": true,
"existing_themes": ["transactions", "ledger"],
"drift_gate_installed": false
}
HARD GATE:
- If not Go → STOP.
- If no lib-observability tracing/metrics usage detected (canonical or deprecated shim) → STOP, surface "service is not instrumented; instrument the service before dashboard authoring, then use ring:implementing-tasks to verify observability checks pass" to user.
- If service has < 3 metric/trace/log emissions → STOP, surface "insufficient telemetry surface for dashboards".
Gate 1: Telemetry Sweep (7 Parallel Angles)
⛔ STOP-CHECK BEFORE DISPATCH
Before emitting any Task call, count the explorers you intend to launch in this turn.
- Count MUST equal 7.
- If count < 7 → STOP. Do not partial-dispatch. Reconcile against the 7 angles below and try again.
- The 7 angles are the canonical sweep. No substitutions, no omissions.
⛔ MUST NOT trickle-dispatch
All 7 explorers leave in the SAME TURN, before reading any explorer output.
Forbidden sequences:
- Dispatch explorer 1 → read result → dispatch explorer 2
- Dispatch a subset → wait → dispatch the rest
- Dispatch follow-up explorers conditioned on partial output
- Loop sequentially over the angle list
If you find yourself about to dispatch an explorer in a turn AFTER any explorer has already returned a result → STOP. You violated parallel dispatch. Report the violation and mark the gate INCOMPLETE rather than completing the trickle.
Self-verify after dispatch
After the dispatch turn, verify all 7 Task calls were emitted in that single turn. If fewer than 7 went out, the gate did NOT execute correctly. Mark INCOMPLETE and surface the dispatch failure — do NOT silently continue with a partial pool.
Parallel dispatch — atomic batch
Emit all 7 Task calls in a SINGLE TURN, as one atomic batch.
If your runtime exposes a multi_tool_use.parallel wrapper, use it to dispatch the complete pool in one wrapped invocation. This is the canonical fan-out mechanism on OpenAI-style tool envelopes and on certain Anthropic SDK consumers — naming it explicitly activates parallel emission on runtimes where trickle-dispatch is the default behavior.
If your runtime emits parallel tool_use blocks natively (Claude Code with Claude models), multi_tool_use.parallel may not be needed — but naming it is harmless and serves as an enforcement anchor.
The STOP-CHECK, anti-trickle, and self-verify guards above remain binding regardless of which mechanism your runtime uses.
Dispatch all 7 angles in one parallel batch. Wait for all before Gate 2.
Per-explorer dispatch (subagent_type: ring:codebase-explorer):
## Target: <absolute path>
## Your Angle: <angle number + name>
## Severity / Detection Patterns / Schema / Notes
<verbatim from sub-files/sweep-angles.md for this angle>
## Output
Write to: /tmp/dashboards-sweep-{N}-{angle-slug}.json
Schema: { angle_number, angle_name, primitives: [...] }
Each primitive includes file:line, name, description, labels/attributes, unit, type-specific fields.
If no findings: write file with empty primitives array.
The 7 angles cover:
- Counter metrics —
meter.Int64Counter,Float64Counter, increments, labels, descriptions, units - Histogram metrics —
meter.Float64Histogram,Int64Histogram, boundaries, units, labels - Gauge metrics —
meter.Int64UpDownCounter,Int64ObservableGauge, callbacks - Trace spans —
tracer.Start, span names, kind, attributes, parent-child structure, error recording - Structured log fields —
log.With, level usage, contexts where emitted, trace correlation - Cross-cutting concerns —
tenant_idlabeling,trace_id/span_idpropagation, error attribution, request correlation - Framework instrumentation — Fiber/gRPC/RabbitMQ middleware, auto-spans, manual override sites
Full angle specifications: sub-files/sweep-angles.md.
Verification: 7 JSON files exist, all parse, schema-valid per sub-files/dictionary-schema.md.
HARD GATE: Missing/malformed file → re-dispatch ONLY failing angle.
Gate 2: Dictionary Assembly + Validation
Orchestrator merges 7 angle JSONs into /tmp/dashboards-dictionary.json. Validate per sub-files/dictionary-schema.md:
- Metric names match
^[a-z][a-z0-9_]*$(Prometheus convention) - Span names match
^[a-z][a-z0-9_.-]*$ - All primitives have
description≥ 30 chars - Histograms declare
unit(seconds, bytes, count) andboundariesif custom - Tenant-scoped primitives have
tenant_idin labels/attributes - No duplicate
(name, type)pairs across angle outputs - Cross-cutting Angle 6 findings cross-reference primitives from Angles 1–4
Validation failures → re-dispatch failing angle's explorer with correction notes. Do NOT manually edit JSON.
Gate 3: Dictionary Rendering
Orchestrator writes docs/dashboards/telemetry-dictionary.md from validated JSON, following sub-files/dictionary-schema.md rendering contract:
- YAML frontmatter
_metablock: service name, generated-at timestamp, source commit SHA, lib-commons version, primitive counts - Metrics section: one
### {metric_name}per metric, with stable YAML block (type, unit, labels, description, emission_sites) - Traces section: one
### {span_name}per span, with stable YAML block (kind, attributes, parents, emission_sites) - Logs section: structured fields catalog with levels and emission contexts
- Cross-cutting section: tenant propagation map, trace correlation map, error attribution map
Critical: rendering MUST be deterministic — same input JSON produces byte-identical output. Order alphabetically within each section. Sort labels alphabetically within each primitive. This is what makes drift detection in Gate 7 possible.
Gate 4: Theme Proposal + Dashboard Plans
Orchestrator analyzes the dictionary and proposes themes + dashboards. This is the LLM-opinion gate — apply sub-files/reference.md (RED/USE methodology, panel pattern catalog) to dictionary contents.
For each proposed theme, produce a dashboard plan stub at /tmp/dashboards-plan-{theme}.md:
# Theme: {theme_name}
## Audience (proposed)
- Primary: <engineering | product | exec | ops | support>
- Secondary: <...>
## Dashboards
### {dashboard_1_name}
- Methodology: RED | USE | hybrid
- SLIs surfaced: <list>
- Time range default: <1h | 6h | 24h | 7d>
- Panels:
1. {panel_name} — {panel_pattern} on metric {metric_ref} — Grafonnet template: {template}
2. ...
- Alert candidates: <list with thresholds>
### {dashboard_2_name}
...
Themes are SUGGESTIONS only. PM may rename, merge, split, or reject in Gate 5.
Gate 5: PM Iteration — NEVER SKIPPABLE
Present sub-files/pm-iteration-prompts.md checklist to PM team:
- Theme names — accept, rename, merge, split?
- Audience per theme — correct?
- Methodology choice (RED vs USE vs hybrid) — sound for this domain?
- SLIs surfaced — match what stakeholders actually need?
- Time range defaults — match operational cadence?
- Alert thresholds vs informational — which panels need alerts attached?
- Missing dashboards — anything PM expected that wasn't proposed?
Response options:
APPROVED: <theme1> <theme2> ...→ proceed to Gate 6 for listed themesREVISE theme {name}: <change>→ loops Gate 4 for that theme onlyRENAME theme {old} -> {new}→ renames in plan, loops Gate 4 lightREJECT theme {name}→ drops from approval listADD theme {name}: <description>→ orchestrator generates new plan, loops Gate 4BLOCKED: <reason>→ halts skill, returns with surface for triage
HARD GATE: Must not proceed to Gate 6 without explicit APPROVED: ... listing at least one theme.
Gate 6: Grafonnet Authoring (Per Approved Theme)
For EACH approved theme, dispatch ring:backend-go (Lerian's Go specialist; Grafonnet is jsonnet, but the engineer's discipline around code quality and reusability transfers — and they own the lib-commons mental model that makes label correctness checkable).
Per-theme dispatch:
## Target: docs/dashboards/{theme}/
## Inputs:
- /tmp/dashboards-plan-{theme}.md (PM-approved plan)
- docs/dashboards/telemetry-dictionary.md (canonical primitive contract)
- pm-team/skills/creating-grafana-dashboards/sub-files/grafonnet-templates/ (panel libsonnet templates)
- pm-team/skills/creating-grafana-dashboards/sub-files/reference.md (panel pattern → template mapping)
## Task:
1. Create docs/dashboards/{theme}/ directory
2. Write {theme}.libsonnet importing relevant grafonnet-templates panels
3. Materialize each panel from the plan with concrete metric refs from the dictionary
4. Compose into a Dashboard with rows/grid per plan structure
5. Write README.md per theme explaining: audience, SLIs, intended use, alert thresholds
6. Validate: jsonnet compiles cleanly, panel queries reference only primitives present in dictionary
## Output:
- docs/dashboards/{theme}/{theme}.libsonnet
- docs/dashboards/{theme}/README.md
- /tmp/dashboards-build-{theme}.log (compilation output)
## Constraints:
- MUST NOT invent metric names — every PromQL/LogQL/TraceQL query references primitives from telemetry-dictionary.md
- MUST template tenant_id as a Grafana variable when present in primitive labels
- MUST follow Grafonnet conventions (no raw JSON; if a panel pattern isn't in templates/, propose a new template)
Verification: Per theme — libsonnet compiles to JSON, README present, no metric references missing from dictionary.
HARD GATE: Compilation failure → re-dispatch with diagnostic, do not move to Gate 7.
Gate 7: CI Drift Gate Setup
Orchestrator installs (idempotent) the drift detection workflow per sub-files/ci-drift-check.md:
- Write
.github/workflows/telemetry-drift.yml(blocking PR check) - Write
scripts/regenerate-telemetry-dictionary.sh(regenerates dictionary; called by CI and locally) - Update
Makefile: addmake telemetry-dictionarytarget invoking the regenerate script - Compile all theme libsonnet to JSON in
docs/dashboards/{theme}/{theme}.jsonand add to .gitignore (build artifact) - Surface to user: workflow path, local regen command, expected first-CI-run behavior
Idempotence: If .github/workflows/telemetry-drift.yml already exists, diff against canonical version. Update only if drift detected. Surface diff to user before overwriting.
State Persistence
Save to /tmp/dashboards-state.json:
{
"skill": "creating-grafana-dashboards",
"service_name": "<from Gate 0>",
"current_gate": 0,
"gates": {
"0": "PENDING",
"1": "PENDING",
"2": "PENDING",
"3": "PENDING",
"4": "PENDING",
"5": "PENDING_USER_APPROVAL",
"6": "PENDING",
"7": "PENDING"
},
"metrics": {
"primitives_counters": 0,
"primitives_histograms": 0,
"primitives_gauges": 0,
"primitives_spans": 0,
"primitives_log_fields": 0,
"themes_proposed": 0,
"themes_approved": 0,
"dashboards_authored": 0
}
}
REFERENCE MODE
Full reference content in sub-files/reference.md. Load sections relevant to the question.
Quick Navigation
| # | Section | What you'll find |
|---|---|---|
| 1 | RED Methodology | Rate, Errors, Duration — when each metric type fits |
| 2 | USE Methodology | Utilization, Saturation, Errors — for resources |
| 3 | Panel Pattern Catalog | Mapping primitive type → panel pattern → Grafonnet template |
| 4 | Theme Decision Tree | How to suggest themes from dictionary contents |
| 5 | Grafonnet Conventions | Naming, composition, variable conventions, tenant templating |
| 6 | Alert Threshold Heuristics | When to attach alerts, default thresholds, escalation tiers |
| 7 | Cross-cutting Patterns | Tenant variable, trace exemplars, log-to-trace links |
| 8 | Anti-pattern Catalog | Six failure modes (vanity panels, alert noise, etc.) |
Read sub-files/reference.md for full detail.