logql-observability

star 1

Queries workload logs and builds log dashboards on Control Plane. Use when the user asks about LogQL syntax, log search, viewing logs in Grafana, stream selectors, log filtering, log patterns, or log-based metrics and dashboards.

controlplane-com By controlplane-com schedule Updated 6/12/2026

name: logql-observability description: "Queries workload logs with LogQL on Control Plane. Use to troubleshoot a workload from its logs, or for log search, access or egress logs, cron run logs, missing or truncated logs, retention, or logs in Grafana."

LogQL & Log Observability

Tool availability: some MCP tools named here live in the full toolset profile — if one is not advertised on this connection, tell the user to reconnect the MCP server with ?toolsets=full (or use the cpln CLI fallback). Reads and deletes work on every profile via the generic list_resources / get_resource / delete_resource tools.

Control Plane stores workload stdout/stderr in Loki and queries it with LogQL. The org is the Loki tenant — it comes from the endpoint path, so org is never a query label and queries cannot cross orgs. Reading logs requires the org-level readLogs permission, and the in-pod CPLN_TOKEN cannot authenticate to the logs endpoint — use a user or service-account token (see the workload skill). The recurring agent failure is passing a raw query to the MCP tool alongside structured params: a raw query replaces them entirely (the tool rejects the combination), so a raw query must embed every label itself.

Two ways to query

  • MCP (primary for agents): mcp__cpln__get_workload_logs — structured params gvc (required), workload, container, location, filter (literal substring, LogQL |=, not regex); window since (default 1h) or from/to (ISO 8601, from inclusive, to exclusive); limit (default 30, max 999, single request — truncated: true means narrow the window or filter harder); order (oldest_first default, or newest_first). For regex, parsers, or the replica/stream/version labels, pass a raw query (max 500 chars) instead of the structured selectors.
  • CLI: cpln logs '<LOGQL>' — interactive debugging (live --tail), scripts, CI.
# Defaults: --since 1h, --limit 30, --direction forward
cpln logs '{gvc="GVC", workload="WORKLOAD"}' --org ORG
cpln logs '{gvc="GVC", workload="WORKLOAD"} |= "error"' --limit 100
cpln logs '{gvc="GVC", workload="WORKLOAD", container="main"}' --since 7d
cpln logs '{gvc="GVC", workload="WORKLOAD"}' --tail              # live follow; the server ends a tail session after 30m
cpln logs '{gvc="GVC", workload="WORKLOAD"}' \
  --from 2026-06-01T00:00:00Z --to 2026-06-02T00:00:00Z          # ISO 8601 only; from inclusive, to exclusive
cpln logs '{gvc="GVC", workload="WORKLOAD"} |= "error"' --since 24h --limit 0   # 0 = unlimited, auto-paginates
cpln logs '{gvc="GVC", workload="WORKLOAD"}' -o jsonl            # one JSON object per line; -o raw = bare lines
  • --since takes relative durations (ms s m h d w mo y, compound like 1h30m). --from/--to take ISO 8601 timestamps only--from now-2d and --from 7d fail with Cannot parse ... into a valid date.
  • cpln workload eventlog WORKLOAD (alias cpln workload log) is resource event history, not container output — for application logs always use cpln logs.

Labels

Label Value
gvc GVC name
workload Workload name
container Container name, or a built-in stream below
location Deployment location, e.g. aws-us-east-1
provider Cloud provider
replica Replica (pod) name — unique per cron execution
stream stdout or stderr
version Workload deployment version that wrote the line

At least one non-empty matcher is required; regex matchers work — {gvc=~".+"} spans every GVC in the org.

Filters and LogQL features

Operator Meaning Example
|= "text" contains |= "error"
!= "text" does not contain != "health"
|~ "regex" matches regex |~ "timeout|crash"
!~ "regex" does not match !~ "debug|trace"

Loki is current (3.x), so full LogQL works: parsers (| json, | logfmt, | pattern, | regexp), post-parse label filters (| latency > 100), line_format, and metric queries (count_over_time, rate, sum ... by). The CLI and MCP tool print log lines only — run metric queries in Grafana: the Explore on Grafana link on the console Logs page opens it with the query prefilled (the org grafanaAdmin permission grants the Grafana Admin role; everyone else is Viewer).

{gvc="GVC", workload="WORKLOAD"} |= "error" != "health"               # errors minus noise
{gvc="GVC", workload="WORKLOAD"} |~ "panic|fatal|exception"           # crashes and stack traces
{gvc="GVC", workload="WORKLOAD", container="_accesslog"} |= "\" 50"   # HTTP 5xx in access logs
sum(count_over_time({gvc="GVC", container="_accesslog"} |= "\" 50"[1m])) by (workload)   # 5xx rate (Grafana)

Built-in log streams

Selector Contents
container="_accesslog" Inbound requests (Envoy access-log format) on the workload's ports
container="_requestlog" The workload's outbound (egress) requests through the sidecar, same format
workload="_loadbalancer" Access logs of the GVC's dedicated load balancer
container="_alerts" Threat-detection (Falco) alerts, with extra labels rule, priority, source

Platform health probes and unroutable-request noise are filtered out of _accesslog by design, so probe traffic never shows up there. Sidecar and system containers (istio-init, istio-validation, cpln-*, debugger-*) are never collected.

Why logs go missing (pipeline limits)

  • Lines over 16 KiB are cut at 16 KiB with a ... [truncated N bytes] suffix; empty lines are dropped.
  • Per-replica rate limit: each replica+container pair is capped (roughly 10,000 lines/s on managed clusters; effectively unlimited on BYOK). Excess lines are dropped, and when collection resumes one marker line appears: #### Replica logs were rate-limited: N lines in the last Xs were not collected ####.
  • Retention: org spec observability.logsRetentionDays, default 30 (0 turns log collection off entirely; the org-management skill owns org spec edits). Queries beyond retention return nothing, not an error.
  • Server caps: one query may span at most 31 days and times out after 2 minutes; tail sessions end after 30 minutes.

Cron workloads: logs for one execution

{gvc=, workload=} on a cron workload interleaves every past run. Each execution runs in its own replica, so scope to one run with the replica label plus the execution's time window.

1. List executions. mcp__cpln__list_deployments with location returns the full deployment JSON; status.jobExecutions[] holds the per-execution metadata (without location the tool returns only a readiness summary). CLI:

cpln workload get-deployments WORKLOAD --gvc GVC -o json | jq '
  [(.items // .)[] as $d | $d.status.jobExecutions[]? | . + {location: $d.name}]
  | sort_by(.startTime)[] | {location, name, status, startTime, completionTime, replica}'

Schema facts (nodelibs cronjob.ts) that bite scripts:

  • status is one of successful | failed | active | pending | invalid | removed (default pending). Running means status == "active" AND no completionTime; a missing completionTime alone proves nothing (the schema warns it is not an indication of success).
  • completionTime: null is stripped server-side and startTime may be absent for never-started runs — each field is present-with-value or absent, so guard flags with ${VAR:+--from "$VAR"} and jq // empty.
  • replica is optional: when missing, the run never got a pod and there are no logs — diagnose with the execution's containers map and message field (aggregated pod events) and mcp__cpln__get_workload_events.

2. Query that replica, time-bounded. MCP: raw query embedding ALL labels (structured params must be omitted) plus from/to. CLI:

cpln logs '{gvc="GVC", workload="WORKLOAD", location="LOCATION", replica="REPLICA-ID"}' \
  --org ORG --from START_TIME --to COMPLETION_TIME

jobExecutions timestamps are ISO 8601 and pass straight through. Pad the window a minute or two on each side, and widen it before concluding logs do not exist. For a live run (active, no completionTime), drop --to and add --tail.

Quick reference

Tool Use
mcp__cpln__get_workload_logs LogQL queries — structured params or raw query
mcp__cpln__list_deployments Deployment health; cron status.jobExecutions (pass location)
mcp__cpln__get_workload_events Probe failures, scheduling, restarts — events, not app logs

CI/CD and headless use: set CPLN_TOKEN and run cpln logs directly (no profile needed); the principal must hold org readLogs.

Troubleshooting

Symptom Cause and fix
403 requires permission "readLogs" in org Grant readLogs on the org via policy (it implies view)
Cannot parse now-2d into a valid date --from/--to are ISO 8601 only; relative lookback is --since
Empty result for known activity Window outside retention (default 30d), span over 31 days, or filters too narrow; app may not write to stdout/stderr
#### Replica logs were rate-limited ... #### marker Per-replica cap was hit; lines in that interval are gone — reduce log volume
Line ends with ... [truncated N bytes] 16 KiB per-line cap — emit smaller lines
Health checks absent from _accesslog Filtered by design; probe failures surface in mcp__cpln__get_workload_events
MCP rejects raw query combined with workload etc. A raw query replaces the structured params — embed all labels in the query itself
cpln workload log shows no app output That is the eventlog alias; use cpln logs

Related skills

Skill Owns
workload Deploy and diagnose flow, injected CPLN_* env vars, canonical URLs
metrics-observability PromQL, default metrics, Grafana alert rules, Prometheus federation
external-logging Shipping logs to S3, Datadog, Coralogix, and other providers
mk8s-byok mk8s cluster logs add-on (cluster_name / namespace labels)

Documentation

Install via CLI
npx skills add https://github.com/controlplane-com/ai-plugin --skill logql-observability
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
controlplane-com
controlplane-com Explore all skills →