debug-with-grafana - SKILL.md Agent Skill

name: debug-with-grafana description: > Structured diagnostic workflow for debugging application issues using Grafana observability data. Use when the user reports errors, latency spikes, service degradation, HTTP 500s, or wants to investigate why a service is behaving unexpectedly. Triggers for: "my API is returning 500 errors", "latency is spiking", "service seems down", "help me debug using Grafana", "investigate why requests are failing", "something is wrong with my service".

Debug with Grafana

A structured 7-step diagnostic workflow for debugging application issues using Prometheus metrics, Loki logs, and Grafana resources. Follow steps in order — each step informs the next.

Prerequisites

gcx must be installed and configured with a valid context before running any commands. If not configured, use the setup-gcx skill first:

# Verify configuration
gcx config view

# Switch context if needed
gcx config use-context <context-name>

Diagnostic Workflow

Step 1: Discover Datasources

List all available datasources to identify Prometheus and Loki UIDs. All subsequent query commands require a datasource UID via -d <uid>.

# List all datasources
gcx datasources list -o json

# Filter by type for scripting
gcx datasources list -t prometheus -o json
gcx datasources list -t loki -o json

# Capture UIDs for use in subsequent steps
PROM_UID=$(gcx datasources list -t prometheus -o json 2>/dev/null | \
  python3 -c "import json,sys; print(json.load(sys.stdin)['datasources'][0]['uid'])")
LOKI_UID=$(gcx datasources list -t loki -o json 2>/dev/null | \
  python3 -c "import json,sys; print(json.load(sys.stdin)['datasources'][0]['uid'])")

Expected output shape:

{
  "datasources": [
    {"uid": "<uid>", "name": "<display-name>", "type": "prometheus", ...},
    {"uid": "<uid>", "name": "<display-name>", "type": "loki", ...}
  ]
}

If no datasources appear, confirm the context is pointing at the correct Grafana instance. See references/error-recovery.md for auth and datasource-not-found recovery patterns.

JSON output piping: When piping gcx output through external tools, never use 2>&1 — gcx writes hints to stderr that break JSON parsers. Use 2>/dev/null to suppress stderr, or use --json field1,field2 to select fields directly without piping:
gcx datasources list -t prometheus --json uid
gcx metrics query -d <prom-uid> 'up' --json metric,value
Use --json list to discover available fields for any command.

Step 2: Confirm Data Availability

Before querying specific metrics, confirm the target service is instrumented and data is flowing. This avoids wasting time on empty results.

# Check that the target service is being scraped
gcx metrics query -d <prom-uid> 'up' -o json

# Verify the relevant job label exists
gcx metrics labels -d <prom-uid> -l job -o json

# For Loki: confirm log streams exist for the service
gcx logs labels -d <loki-uid> -l job -o json
gcx logs series -d <loki-uid> -M '{job="<service-name>"}' -o json

# Spot-check: confirm uptime metrics are present for the service
gcx metrics query -d <prom-uid> 'up{job="<service-name>"}' -o json

Expected output shape:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {"metric": {"__name__": "up", "job": "<service-name>", "instance": "<host:port>"}, "value": [<timestamp>, "<0-or-1>"]}
    ]
  }
}

A value of "0" means the service is down or not being scraped. Empty result array means the metric is absent — see Failure Mode 3 in references/error-recovery.md.

Step 3: Query Error Rates

Query the HTTP 5xx error rate over the relevant time window to establish whether an error spike exists and when it began.

# HTTP 5xx error rate (range query for trend)
gcx metrics query -d <prom-uid> \
  'rate(http_requests_total{job="<service-name>",status=~"5.."}[5m])' \
  --from now-1h --to now --step 1m -o json

# Visualize the trend
gcx metrics query -d <prom-uid> \
  'rate(http_requests_total{job="<service-name>",status=~"5.."}[5m])' \
  --from now-1h --to now --step 1m -o graph

# Error ratio (errors / total)
gcx metrics query -d <prom-uid> \
  'rate(http_requests_total{job="<service-name>",status=~"5.."}[5m]) / rate(http_requests_total{job="<service-name>"}[5m])' \
  --from now-1h --to now --step 1m -o json

# Break down by status code to identify 500 vs 503 vs 504
gcx metrics query -d <prom-uid> \
  'sum by(status) (rate(http_requests_total{job="<service-name>"}[5m]))' \
  --from now-1h --to now --step 1m -o json

Expected output shape (matrix for range queries):

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"job": "<service-name>", "status": "<code>"},
        "values": [[<timestamp>, "<rate>"], ...]
      }
    ]
  }
}

Note the timestamp where the rate increases — this is the incident start time. Use this window in subsequent steps.

Step 4: Query Latency

Query request latency to determine whether the service is slow (latency issue) or failing fast (error issue). High latency often precedes error spikes.

# P50/P95/P99 latency from histogram
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="<service-name>"}[5m]))' \
  --from now-1h --to now --step 1m -o json

# Visualize P95 latency trend
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="<service-name>"}[5m]))' \
  --from now-1h --to now --step 1m -o graph

# Average latency as a simpler signal if histograms are unavailable
gcx metrics query -d <prom-uid> \
  'rate(http_request_duration_seconds_sum{job="<service-name>"}[5m]) / rate(http_request_duration_seconds_count{job="<service-name>"}[5m])' \
  --from now-1h --to now --step 1m -o json

# Latency by endpoint (if label available)
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, sum by(le, handler) (rate(http_request_duration_seconds_bucket{job="<service-name>"}[5m])))' \
  --from now-1h --to now --step 1m -o json

Expected output shape:

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"job": "<service-name>"},
        "values": [[<timestamp>, "<seconds>"], ...]
      }
    ]
  }
}

Compare the latency onset time with the error onset time from Step 3. If latency rose before errors, a dependency or resource constraint is likely.

Step 5: Correlate Logs

Query Loki for error logs in the time window identified in Steps 3 and 4. Logs provide the specific error messages, stack traces, and context that metrics cannot.

# Error logs for the service in the incident window
gcx logs query -d <loki-uid> \
  '{job="<service-name>"} |= "error"' \
  --from now-1h --to now -o json

# JSON-parsed logs with level filter (if structured logging)
gcx logs query -d <loki-uid> \
  '{job="<service-name>"} | json | level="error"' \
  --from now-1h --to now -o json

# Error rate from logs (count over time)
gcx logs query -d <loki-uid> \
  'count_over_time({job="<service-name>"} |= "error" [5m])' \
  --from now-1h --to now --step 1m -o json

# Grep for specific error patterns
gcx logs query -d <loki-uid> \
  '{job="<service-name>"} |~ "timeout|connection refused|OOM|panic"' \
  --from now-1h --to now -o json

Expected output shape (streams):

{
  "status": "success",
  "data": {
    "resultType": "streams",
    "result": [
      {
        "stream": {"job": "<service-name>", "level": "<level>"},
        "values": [["<ns-timestamp>", "<log-line>"], ...]
      }
    ]
  }
}

LogQL pitfall: Loki requires at least one non-empty label matcher in the stream selector. {} and {} |~ "pattern" will be rejected. Always include at least one label, e.g., {job=~".+"} as a catch-all.

Look for:

Repeated error messages pointing to a specific code path or dependency
Timestamps of first error matching the metric spike time from Step 3
Stack traces or panic messages that identify the root cause
Upstream service names in error messages (database, external APIs)

Step 5b: Correlate Traces (if Tempo is available)

If a Tempo datasource exists, search for traces matching the incident window. Traces show individual request paths and identify slow or failing spans.

# Check for Tempo datasources
gcx datasources list -t tempo -o json

# Search for error traces in the incident window
gcx traces query -d <tempo-uid> '{ status = error }' --from now-1h --to now

# Search by service name
gcx traces query -d <tempo-uid> '{ resource.service.name = "<service-name>" }' --from now-1h --to now

# Search for slow traces (duration > 1s)
gcx traces query -d <tempo-uid> \
  '{ resource.service.name = "<service-name>" && duration > 1s }' \
  --from now-1h --to now

# Fetch a specific trace by ID for analysis (from search results or log trace IDs)
# Always use --llm so Tempo returns its token-efficient LLM trace encoding.
gcx traces get -d <tempo-uid> <trace-id> --llm -o json

TraceQL attribute scoping: Tempo requires scoped attribute names. Use resource. for resource-level attributes and span. for span-level:

resource.service.name (not service.name)
span.http.status_code (not http.status_code)

Use name (unscoped) for the span name, duration for span duration, and status for span status. Use trace:rootService and trace:rootName for root span attributes (not rootServiceName or rootTraceName).

When inspecting trace bodies, use gcx traces get <trace-id> --llm -o json. Do not fetch the OTLP-shaped default trace and manually compact it unless the user explicitly needs raw trace JSON for schema/debugging work.

Discover available labels and values:

gcx traces labels -d <tempo-uid>
# For agent workflows, request Tempo's compact LLM label-value encoding.
gcx traces tags -d <tempo-uid> -l resource.service.name --llm -o json

Common mistake: gcx traces labels -l service.name will fail — Tempo parses the dot as an identifier boundary. Always fully qualify: -l resource.service.name, not -l service.name.

See references/traceql-patterns.md for full TraceQL syntax reference.

Step 6: Check Related Dashboards and Resources

Check whether relevant dashboards exist that give broader context, and inspect related Grafana resources that may explain the issue (e.g., alert rules that are firing).

# List all alert rules to find any firing for this service
gcx alert rules list -o json | jq '.[] | .rules[]? | select(.labels.job == "<service-name>")'

# Pull dashboards locally to inspect their panel queries
gcx resources pull dashboards -o json

# List available resources to find service-specific dashboards
gcx resources get dashboards -o json | jq '.items[] | select(.metadata.name | test("<service-name>"; "i"))'

# If a relevant dashboard UID is known, get it directly
gcx resources get dashboards/<dashboard-uid> -o json

Capture a visual snapshot of a relevant dashboard

If a relevant dashboard UID is known, capture a PNG snapshot to visually inspect panel layout and current state. This is especially useful when diagnosing layout regressions, missing data, or anomalous panel values.

# First, discover which template variables the dashboard uses so you can
# pin them to the values relevant to the incident being debugged
gcx resources get dashboards/<dashboard-uid> -ojson | \
  jq '.spec.templating.list[] | {name, type, current: .current.value}'

# Capture a full dashboard snapshot with variables matching the incident context
# (requires grafana-image-renderer plugin on the Grafana instance)
gcx dashboards snapshot <dashboard-uid> --output-dir ./debug-snapshots \
  --var cluster=<cluster> --var job=<service-name> --since 1h

# Capture the incident time window explicitly
gcx dashboards snapshot <dashboard-uid> --from now-1h --to now \
  --var cluster=<cluster> --var job=<service-name> --output-dir ./debug-snapshots

# Capture a specific panel (find panel IDs: .spec.panels[].id in the dashboard JSON)
gcx dashboards snapshot <dashboard-uid> --panel <panel-id> \
  --output-dir ./debug-snapshots

# If stuck with flags: gcx dashboards snapshot --help

Cross-reference with metrics and logs:

Are there alert rules in a firing or pending state for this service?
Do existing dashboards show additional signals (queue depth, DB connections, memory pressure)?
Do dashboard panel queries reveal which metrics are being monitored?
Does the dashboard snapshot show unexpected panel states or missing data?

Step 7: Summarize Findings

After completing Steps 1-6, synthesize the findings into a clear diagnostic summary for the user.

Structure the summary as:

Service: <service-name>
Time window: <from> to <to>
Incident start: <timestamp from error rate onset>

Error signal:
  - Error rate: <trend description, not fabricated value>
  - Status codes: <which codes are elevated>

Latency signal:
  - P95 latency: <trend description>
  - Latency onset: <before/after/same time as errors>

Log evidence:
  - Error pattern: <recurring message or exception>
  - First occurrence: <timestamp>
  - Frequency: <how often in the window>

Related resources:
  - Firing alerts: <names or "none found">
  - Relevant dashboards: <names or UIDs>

Likely root cause:
  - <Primary hypothesis based on all signals>

Recommended next actions:
  1. <Specific action — check dependency, review deploy, inspect resource usage>
  2. <Additional action>

Use -o graph for any visualizations shared with the user. Use -o json for data retrieved for your own analysis.

Example Scenarios

Scenario 1: HTTP 500 Error Spike

Trigger: User reports "my API started returning 500 errors 30 minutes ago".

Command sequence:

# Step 1: Find datasource UIDs
gcx datasources list -t prometheus -o json
gcx datasources list -t loki -o json

# Step 2: Confirm service is being scraped
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json

# Step 3: Observe error rate over last 2 hours (wider window to see the spike start)
gcx metrics query -d <prom-uid> \
  'rate(http_requests_total{job="api",status=~"5.."}[5m])' \
  --from now-2h --to now --step 1m -o graph

# Identify which status codes are elevated
gcx metrics query -d <prom-uid> \
  'sum by(status) (rate(http_requests_total{job="api"}[5m]))' \
  --from now-2h --to now --step 1m -o json

# Step 4: Check if latency rose at the same time
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))' \
  --from now-2h --to now --step 1m -o graph

# Step 5: Get error logs in the spike window
gcx logs query -d <loki-uid> \
  '{job="api"} |= "error"' \
  --from now-2h --to now -o json

# Step 6: Check alert rules
gcx alert rules list -o json | jq '.[] | .rules[]? | select(.state == "firing")'

Expected output shape at Step 3 (matrix):

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"job": "api", "status": "500"},
        "values": [[<timestamp>, "<rate>"], ...]
      }
    ]
  }
}

Interpretation: Look for the timestamp where values shows the rate increasing from baseline. Match this to log timestamps in Step 5.

Scenario 2: Latency Degradation

Trigger: User reports "requests are taking much longer than usual, no errors yet".

Command sequence:

# Step 1: Find datasource UIDs
gcx datasources list -t prometheus -o json

# Step 2: Confirm service health (latency without errors suggests slow dependency)
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json

# Step 3: Error rate (confirm it's not elevated yet)
gcx metrics query -d <prom-uid> \
  'rate(http_requests_total{job="api",status=~"5.."}[5m])' \
  --from now-1h --to now --step 1m -o json

# Step 4: P95 latency is the primary signal — visualize trend
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))' \
  --from now-2h --to now --step 1m -o graph

# Break down by endpoint to isolate which routes are slow
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, sum by(le, handler) (rate(http_request_duration_seconds_bucket{job="api"}[5m])))' \
  --from now-1h --to now --step 1m -o json

# Step 5: Check for timeout log patterns suggesting upstream dependency issue
gcx logs query -d <loki-uid> \
  '{job="api"} |~ "timeout|slow|waiting"' \
  --from now-2h --to now -o json

# Check database or downstream service latency if metrics available
gcx metrics query -d <prom-uid> \
  'rate(db_query_duration_seconds_sum{job="api"}[5m]) / rate(db_query_duration_seconds_count{job="api"}[5m])' \
  --from now-2h --to now --step 1m -o json

Expected output shape at Step 4 (histogram):

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {"job": "api"},
        "values": [[<timestamp>, "<seconds>"], ...]
      }
    ]
  }
}

Interpretation: Rising values across all endpoints suggests a shared resource or dependency. Rising values for one endpoint only suggests a handler-specific issue. Compare latency onset time with log timestamps.

Scenario 3: Service Down / No Data

Trigger: User reports "service seems completely down" or dashboard shows no data.

Command sequence:

# Step 1: Verify datasource connectivity first (simplest possible query)
gcx datasources list -o json

# Step 2: Check whether the service is being scraped at all
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json

# Confirm up metric — value "0" means scrape failure, absent means not scraped
gcx metrics query -d <prom-uid> 'up{job="api"}' -o json

# Check if the job label exists at all (absence = service was never registered)
gcx metrics labels -d <prom-uid> -l job -o json

# Step 3: Without error rate data, check for recent data gaps
gcx metrics query -d <prom-uid> \
  'absent(up{job="api"})' \
  --from now-1h --to now --step 1m -o json

# Step 4: Query latency from any recent data before the outage
gcx metrics query -d <prom-uid> \
  'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job="api"}[5m]))' \
  --from now-3h --to now --step 5m -o graph

# Step 5: Check Loki for last known logs before data disappeared
gcx logs query -d <loki-uid> \
  '{job="api"}' \
  --from now-3h --to now -o json

# Crash or OOM signals in logs
gcx logs query -d <loki-uid> \
  '{job="api"} |~ "panic|OOM|killed|crashed|SIGTERM"' \
  --from now-3h --to now -o json

# Step 6: Check alert rules for any firing service-down alerts
gcx alert rules list -o json | jq '.[] | .rules[]? | select(.state == "firing")'

Expected output shape when service is down (up=0):

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {"__name__": "up", "job": "api", "instance": "<host:port>"},
        "value": [<timestamp>, "0"]
      }
    ]
  }
}

Expected output shape when service was never scraped (absent):

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": []
  }
}

Interpretation:

up=0: Service is registered but failing health checks — check pod/process status
Empty result for up{job="api"}: Job never existed or was removed from scrape config
Data present up to a specific timestamp then absent: Service crashed at that time — correlate with crash logs

References

references/error-recovery.md — Recovery patterns for auth errors (401/403), datasource not found, empty results, query timeouts, and malformed PromQL/LogQL syntax.
references/query-patterns.md — Advanced query patterns for Prometheus and Loki datasources, including time range formats, aggregation patterns, Loki stream operators, and output format reference.
references/traceql-patterns.md — TraceQL query patterns for Tempo trace search, attribute scoping rules, and the distinction between traces query and traces get.