name: grafana description: > Query Grafana dashboards and Prometheus metrics for service health, LLM latency, and infrastructure monitoring. Use this skill when the user asks about service health, monitoring, alerts, or performance metrics.
Grafana Integration
Connection
- Base URL:
$GRAFANA_URL(e.g.https://your-org.grafana.net) - Auth:
Authorization: Bearer $GRAFANA_API_TOKEN(service account token) - Credentials:
GRAFANA_URL,GRAFANA_API_TOKENfrom Settings → Data sources - Caching: 10-minute cache for metadata; query results are NOT cached (time-sensitive)
- Key datasource: Prometheus UID
grafanacloud-prom
Server Lib & API Routes
- File:
server/lib/grafana.ts
Exported Functions
| Function | Description |
|---|---|
listDashboards(query?) |
Search dashboards by query |
getDashboard(uid) |
Full dashboard JSON with panels |
getDatasources() |
List all datasources |
getAlertRules() |
All alert rules (flattened from groups) |
getAlertInstances() |
Currently firing alert instances |
queryDatasource(uid, queries[], from?, to?) |
Proxy to Grafana's /api/ds/query |
API Routes
| Route | Description |
|---|---|
GET /api/grafana/dashboards |
Search dashboards |
GET /api/grafana/dashboard?uid=... |
Full dashboard JSON |
GET /api/grafana/datasources |
List datasources |
GET /api/grafana/alerts |
Alert rules and firing instances |
POST /api/grafana/query |
Query datasource (Prometheus, Loki, etc.) |
Agent Action
Use grafana for agent-facing Grafana work. Do not call /api/grafana/*
directly from the agent.
| Mode | Args | Description |
|---|---|---|
dashboards |
search |
Search dashboards |
dashboard |
uid |
Full dashboard JSON |
datasources |
List datasources | |
alerts |
Alert rules and firing instances | |
query |
datasourceUid, queries, from, to |
Query a datasource |
Dashboard
/adhoc/engineering— Engineering dashboard (mirrors a Grafana dashboard by UID)
Grafana Response Frame Format
{ results: { [refId]: { frames: [{ schema: { fields }, data: { values } }] } } }
values[0]= timestamps (epoch ms),values[1+]= metric values- Series labels from
schema.fields[i].labels(e.g.{ model: "claude-3.5" }) - Transform to Recharts:
{ time: number, [seriesName]: number }[]
Template Variables
$CodegenMode→ defaultquality-v4$AIModel→ regex pattern, default.*(all)$environment→cloudorcloud-v2, defaultcloud- Variables interpolated via string replace before querying
Key Prometheus Metrics
- Codegen:
vcpcodegen_completion_total,vcpcodegen_completion_latency_bucket,vcpcodegen_feedback_total,vcpcodegen_error_total - LLM:
llm_completion_cost_total,llm_input_tokens_total,llm_output_tokens_total,llm_latency_bucket,llm_failures_total,llm_completions_total - Projects:
projects_proposed_config_total,projects_status_total,projects_remote_machine_*,projects_start_duration_bucket - API/Runtime:
api_request_total,with_span_duration_bucket,memory_heap_usage_percent_bucket - Fly.io:
fly_endpoint_total,fly_machine_* - GitHub:
builderbot_pr_created_total,builderbot_pr_closed_total
Key Patterns & Gotchas
queryDatasourceconstructs body with datasource UID in each query target; timestamps as string msgetAlertRulesflattens nested rule groups from unified alerting APIgetAlertInstanceshandles both array and wrapped object response shapes- POST helper caches by default (fine for idempotent endpoints) but
queryDatasourceskips cache - Heatmap panels render as multi-series line/area (Recharts has no native heatmap; PromQL
histogram_quantile()computes quantiles)
Incident Investigation Pattern
For production issues, query Grafana/Prometheus FIRST:
- LLM latency by model (
llm_latency_bucket) - Request rates (
api_request_total) - Error rates (
llm_failures_total) - Instance counts (via Cloud Monitoring)
Then check Sentry for application errors, Cloud Logging for raw logs.