observe-platform

name: observe-platform description: Use when investigating a service issue, checking logs, querying metrics, or verifying the health of any notom-platform resource on Scaleway staging or prod. Queries Loki logs and Prometheus metrics directly via the cockpit API — never punts to Grafana. argument-hint: [service-or-issue] effort: high context: fork agent: stack-golem:platform-scout allowed-tools: Read, Bash(scw config get:), Bash(scw account project list:), Bash(scw containers container list:), Bash(scw rdb instance list:), Bash(scw redis cluster list:), Bash(scw instance server list:), Bash(scw cockpit data-source list:*)

Voice

Read ../../persona.md at the start of this skill. That persona is canonical for all output of this skill. Do not restate persona tone, vocabulary, or emoji rules here; apply the persona with concrete workflow strings only when this skill needs them.

Scope: local to this skill's execution only. Once the final report is printed, revert to the session default voice immediately. Keep scope rules in this section; do not add a separate ## Persona scope section.

This skill is rigid — execute steps in order.

Language

Adapt all output to match the user's language. If the user writes in French, respond in French; if English, in English; if mixed, follow their lead. Technical identifiers (file paths, code symbols, CLI flags, tool names) stay in their original form regardless of language.

When you're invoked

Use this skill to investigate a service issue, check logs, query metrics, or verify the health of any notom-platform resource on Scaleway staging or prod.

Core principle: investigate first, ask later. All tools are available via CLI — never ask the user to open Grafana for something you can query yourself.

Step 0 — Preconditions

Verify scw CLI is available and authenticated (scw config get or scw account project list).
Verify curl and python3 are available (used to query Loki/Prometheus).
Read ../../shared/infra-map.md — the single source of truth for endpoints, SSH aliases, and Loki resource_name values. Substitute its values wherever a step references an infra-map key (LOKI_ENDPOINT, PROM_ENDPOINT, LOKI_RESOURCE_*, SSH_AUTHENTIK_*, ...).

Step 1 — Classify the issue

Start from $ARGUMENTS (a service name or issue description) when non-empty; otherwise classify from the user's report.

Service down / crash? → query Loki logs (Step 3) + Scaleway CLI state (Step 5)
Performance / usage? → query Prometheus metrics (Step 4)
Resource state unclear? → Scaleway CLI state (Step 5)

Step 2 — Create a temporary cockpit token (REQUIRED for Loki & Prometheus)

The cockpit token secret is only returned at creation — never stored. Always create a temporary token, query, then delete.

# 1. Create
TOKEN_JSON=$(scw cockpit token create name=tmp-query \
  token-scopes.0=read_only_logs \
  token-scopes.1=read_only_metrics \
  -o json)
TOKEN=$(echo $TOKEN_JSON | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['secret_key'])")
TOKEN_ID=$(echo $TOKEN_JSON | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['id'])")

# 2. Query (Steps 3 / 4)

# 3. Delete (ALWAYS — even on error)
scw cockpit token delete $TOKEN_ID

Use only the scopes you need: read_only_logs, read_only_metrics, write_only_logs, write_only_metrics.

Step 3 — Loki (logs)

Endpoint: LOKI_ENDPOINT (see infra-map) Auth header: X-Token: $TOKEN · API path: /loki/api/v1/

Service → resource_name mapping

See the LOKI_RESOURCE_* keys in ../../shared/infra-map.md (Atlas API → LOKI_RESOURCE_ATLAS_API, PostgreSQL → LOKI_RESOURCE_POSTGRES, Redis → LOKI_RESOURCE_REDIS, App → LOKI_RESOURCE_APP).

Query last N minutes

LOKI="<LOKI_ENDPOINT — see infra-map>"
START=$(date -v-30M +%s)000000000
END=$(date +%s)000000000

curl -sG -H "X-Token: $TOKEN" \
  --data-urlencode 'query={resource_name="<LOKI_RESOURCE_* — see infra-map>"}' \
  --data "limit=50&start=$START&end=$END&direction=backward" \
  "$LOKI/loki/api/v1/query_range" | python3 -c "
import json, sys, datetime
for stream in json.loads(sys.stdin.read()).get('data',{}).get('result',[]):
    for ts, line in stream.get('values',[]):
        t = datetime.datetime.fromtimestamp(int(ts)//1_000_000_000)
        try: msg = json.loads(line).get('message', line)
        except: msg = line
        print(f'[{t}] {msg[:200]}')
"

Discover labels

curl -s -H "X-Token: $TOKEN" "$LOKI/loki/api/v1/labels"
curl -s -H "X-Token: $TOKEN" "$LOKI/loki/api/v1/label/resource_name/values"

Step 4 — Prometheus (metrics)

Endpoint: PROM_ENDPOINT (see infra-map) Auth header: X-Token: $TOKEN · API path: /prometheus/api/v1/

Metric families by service

Service	Metric prefix
VM Authentik	`instance_server_*`
PostgreSQL	`rdb_instance_postgresql_*`
Redis	`rkv_cluster_*`
Atlas API container	`serverless_container_*`
App S3 bucket	`object_storage_bucket_*`
CDN (Edge Services)	`edge_content_delivery_service_*`
Private network	`vpc_pn_*`

Query a metric

PROM="<PROM_ENDPOINT — see infra-map>"

# Instant query
curl -sG -H "X-Token: $TOKEN" \
  --data-urlencode 'query=serverless_container_requests_per_second' \
  "$PROM/prometheus/api/v1/query" | python3 -c "
import json,sys
for r in json.loads(sys.stdin.read()).get('data',{}).get('result',[]):
    print(r.get('metric'), '->', r.get('value'))
"

# Discover all metric names
curl -s -H "X-Token: $TOKEN" "$PROM/prometheus/api/v1/label/__name__/values"

Key health metrics

serverless_container_cpu_usage_ratio                                          # Container CPU (0–1)
serverless_container_memory_usage_bytes / serverless_container_memory_limit_bytes  # Container memory
serverless_container_instances_total                                          # Container scaling
rdb_instance_postgresql_pg_stat_activity_count                                # PostgreSQL connections
rkv_cluster_redis_memory_used_bytes / rkv_cluster_redis_memory_max_bytes      # Redis memory
instance_server_agent_up                                                      # VM health
instance_server_memory_used / instance_server_memory_total                    # VM memory

Step 5 — Scaleway CLI (management plane — no token needed)

scw containers container list -o json | jq '[.[] | {name, status, min_scale, max_scale}]'  # Container status & scaling
scw rdb instance list -o json | jq '[.[] | {name, status, node_type}]'                     # PostgreSQL health
scw redis cluster list -o json | jq '[.[] | {name, status, node_type}]'                    # Redis health
scw instance server list -o json | jq '[.[] | {name, state, public_ip}]'                   # VM (Authentik) health
scw cockpit data-source list -o json | jq '[.[] | {name, type, synchronized_with_grafana}]'  # Cockpit datasources

Step 6 — SSH into Authentik VMs (when logs/metrics aren't enough)

The Authentik instances are plain Scaleway VMs (Ubuntu 24.04). SSH in as root to inspect docker, journald, or disk directly. Host aliases and IPs: see SSH_AUTHENTIK_STAGING / SSH_AUTHENTIK_PROD in ../../shared/infra-map.md.

ssh <SSH_AUTHENTIK_STAGING — see infra-map> 'docker ps'
ssh <SSH_AUTHENTIK_PROD — see infra-map> 'journalctl -u docker --since "1 hour ago"'

Both aliases live in ~/.ssh/config (User root, auth via the Bitwarden SSH agent at SSH_AGENT_SOCK — see infra-map). The infra-map's Maintenance section covers agent troubleshooting (ssh-add -l, locked agent) and how to refresh an IP after an instance rebuild.

Grafana (visual exploration only)

Dashboards: GRAFANA_DASHBOARDS (see ../../shared/infra-map.md) If datasources appear empty: scw cockpit grafana sync-data-sources

Final report

stack-golem:observe-platform report
  Issue:        <what was investigated>
  Source:       <Loki / Prometheus / Scaleway CLI / SSH>
  Findings:     <logs / metrics summary>
  Diagnosis:    <root cause or status>
  Token:        deleted ✓

Hard rules

Always delete the temporary cockpit token after querying — even on error paths.
Use only the scopes you need when creating tokens.
Never git commit, git push, or git rebase.
Never punt to Grafana for something queryable via CLI/API.
Never store or echo the token secret beyond the ephemeral shell variable.