grafana - SKILL.md Agent Skill

name: grafana description: > Query Grafana dashboards and Prometheus metrics for service health, LLM latency, and infrastructure monitoring. Use this skill when the user asks about service health, monitoring, alerts, or performance metrics.

Base URL: $GRAFANA_URL (e.g. https://your-org.grafana.net)
Auth: Authorization: Bearer $GRAFANA_API_TOKEN (service account token)
Credentials: GRAFANA_URL, GRAFANA_API_TOKEN from Settings → Data sources
Caching: 10-minute cache for metadata; query results are NOT cached (time-sensitive)
Key datasource: Prometheus UID grafanacloud-prom

Function	Description
`listDashboards(query?)`	Search dashboards by query
`getDashboard(uid)`	Full dashboard JSON with panels
`getDatasources()`	List all datasources
`getAlertRules()`	All alert rules (flattened from groups)
`getAlertInstances()`	Currently firing alert instances
`queryDatasource(uid, queries[], from?, to?)`	Proxy to Grafana's `/api/ds/query`

Route	Description
`GET /api/grafana/dashboards`	Search dashboards
`GET /api/grafana/dashboard?uid=...`	Full dashboard JSON
`GET /api/grafana/datasources`	List datasources
`GET /api/grafana/alerts`	Alert rules and firing instances
`POST /api/grafana/query`	Query datasource (Prometheus, Loki, etc.)

Use grafana for agent-facing Grafana work. Do not call /api/grafana/* directly from the agent.

Mode	Args	Description
`dashboards`	`search`	Search dashboards
`dashboard`	`uid`	Full dashboard JSON
`datasources`		List datasources
`alerts`		Alert rules and firing instances
`query`	`datasourceUid`, `queries`, `from`, `to`	Query a datasource

/adhoc/engineering — Engineering dashboard (mirrors a Grafana dashboard by UID)

{ results: { [refId]: { frames: [{ schema: { fields }, data: { values } }] } } }

Codegen: vcpcodegen_completion_total, vcpcodegen_completion_latency_bucket, vcpcodegen_feedback_total, vcpcodegen_error_total
LLM: llm_completion_cost_total, llm_input_tokens_total, llm_output_tokens_total, llm_latency_bucket, llm_failures_total, llm_completions_total
Projects: projects_proposed_config_total, projects_status_total, projects_remote_machine_*, projects_start_duration_bucket
API/Runtime: api_request_total, with_span_duration_bucket, memory_heap_usage_percent_bucket
Fly.io: fly_endpoint_total, fly_machine_*
GitHub: builderbot_pr_created_total, builderbot_pr_closed_total

queryDatasource constructs body with datasource UID in each query target; timestamps as string ms
getAlertRules flattens nested rule groups from unified alerting API
getAlertInstances handles both array and wrapped object response shapes
POST helper caches by default (fine for idempotent endpoints) but queryDatasource skips cache
Heatmap panels render as multi-series line/area (Recharts has no native heatmap; PromQL histogram_quantile() computes quantiles)

For production issues, query Grafana/Prometheus FIRST:

Then check Sentry for application errors, Cloud Logging for raw logs.