observe-platform

star 0

Use when investigating a service issue, checking logs, querying metrics, or verifying the health of any notom-platform resource on Scaleway staging or prod. Queries Loki logs and Prometheus metrics directly via the cockpit API — never punts to Grafana.

g-bastianelli By g-bastianelli schedule Updated 6/11/2026

name: observe-platform description: Use when investigating a service issue, checking logs, querying metrics, or verifying the health of any notom-platform resource on Scaleway staging or prod. Queries Loki logs and Prometheus metrics directly via the cockpit API — never punts to Grafana. argument-hint: [service-or-issue] effort: high context: fork agent: stack-golem:platform-scout allowed-tools: Read, Bash(scw config get:), Bash(scw account project list:), Bash(scw containers container list:), Bash(scw rdb instance list:), Bash(scw redis cluster list:), Bash(scw instance server list:), Bash(scw cockpit data-source list:*)

observe-platform

Voice

Read ../../persona.md at the start of this skill. That persona is canonical for all output of this skill. Do not restate persona tone, vocabulary, or emoji rules here; apply the persona with concrete workflow strings only when this skill needs them.

Scope: local to this skill's execution only. Once the final report is printed, revert to the session default voice immediately. Keep scope rules in this section; do not add a separate ## Persona scope section.

This skill is rigid — execute steps in order.

Language

Adapt all output to match the user's language. If the user writes in French, respond in French; if English, in English; if mixed, follow their lead. Technical identifiers (file paths, code symbols, CLI flags, tool names) stay in their original form regardless of language.

When you're invoked

Use this skill to investigate a service issue, check logs, query metrics, or verify the health of any notom-platform resource on Scaleway staging or prod.

Core principle: investigate first, ask later. All tools are available via CLI — never ask the user to open Grafana for something you can query yourself.

Step 0 — Preconditions

  1. Verify scw CLI is available and authenticated (scw config get or scw account project list).
  2. Verify curl and python3 are available (used to query Loki/Prometheus).
  3. Read ../../shared/infra-map.md — the single source of truth for endpoints, SSH aliases, and Loki resource_name values. Substitute its values wherever a step references an infra-map key (LOKI_ENDPOINT, PROM_ENDPOINT, LOKI_RESOURCE_*, SSH_AUTHENTIK_*, ...).

Step 1 — Classify the issue

Start from $ARGUMENTS (a service name or issue description) when non-empty; otherwise classify from the user's report.

  • Service down / crash? → query Loki logs (Step 3) + Scaleway CLI state (Step 5)
  • Performance / usage? → query Prometheus metrics (Step 4)
  • Resource state unclear? → Scaleway CLI state (Step 5)

Step 2 — Create a temporary cockpit token (REQUIRED for Loki & Prometheus)

The cockpit token secret is only returned at creation — never stored. Always create a temporary token, query, then delete.

# 1. Create
TOKEN_JSON=$(scw cockpit token create name=tmp-query \
  token-scopes.0=read_only_logs \
  token-scopes.1=read_only_metrics \
  -o json)
TOKEN=$(echo $TOKEN_JSON | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['secret_key'])")
TOKEN_ID=$(echo $TOKEN_JSON | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['id'])")

# 2. Query (Steps 3 / 4)

# 3. Delete (ALWAYS — even on error)
scw cockpit token delete $TOKEN_ID

Use only the scopes you need: read_only_logs, read_only_metrics, write_only_logs, write_only_metrics.

Step 3 — Loki (logs)

Endpoint: LOKI_ENDPOINT (see infra-map) Auth header: X-Token: $TOKEN · API path: /loki/api/v1/

Service → resource_name mapping

See the LOKI_RESOURCE_* keys in ../../shared/infra-map.md (Atlas API → LOKI_RESOURCE_ATLAS_API, PostgreSQL → LOKI_RESOURCE_POSTGRES, Redis → LOKI_RESOURCE_REDIS, App → LOKI_RESOURCE_APP).

Query last N minutes

LOKI="<LOKI_ENDPOINT — see infra-map>"
START=$(date -v-30M +%s)000000000
END=$(date +%s)000000000

curl -sG -H "X-Token: $TOKEN" \
  --data-urlencode 'query={resource_name="<LOKI_RESOURCE_* — see infra-map>"}' \
  --data "limit=50&start=$START&end=$END&direction=backward" \
  "$LOKI/loki/api/v1/query_range" | python3 -c "
import json, sys, datetime
for stream in json.loads(sys.stdin.read()).get('data',{}).get('result',[]):
    for ts, line in stream.get('values',[]):
        t = datetime.datetime.fromtimestamp(int(ts)//1_000_000_000)
        try: msg = json.loads(line).get('message', line)
        except: msg = line
        print(f'[{t}] {msg[:200]}')
"

Discover labels

curl -s -H "X-Token: $TOKEN" "$LOKI/loki/api/v1/labels"
curl -s -H "X-Token: $TOKEN" "$LOKI/loki/api/v1/label/resource_name/values"

Step 4 — Prometheus (metrics)

Endpoint: PROM_ENDPOINT (see infra-map) Auth header: X-Token: $TOKEN · API path: /prometheus/api/v1/

Metric families by service

Service Metric prefix
VM Authentik instance_server_*
PostgreSQL rdb_instance_postgresql_*
Redis rkv_cluster_*
Atlas API container serverless_container_*
App S3 bucket object_storage_bucket_*
CDN (Edge Services) edge_content_delivery_service_*
Private network vpc_pn_*

Query a metric

PROM="<PROM_ENDPOINT — see infra-map>"

# Instant query
curl -sG -H "X-Token: $TOKEN" \
  --data-urlencode 'query=serverless_container_requests_per_second' \
  "$PROM/prometheus/api/v1/query" | python3 -c "
import json,sys
for r in json.loads(sys.stdin.read()).get('data',{}).get('result',[]):
    print(r.get('metric'), '->', r.get('value'))
"

# Discover all metric names
curl -s -H "X-Token: $TOKEN" "$PROM/prometheus/api/v1/label/__name__/values"

Key health metrics

serverless_container_cpu_usage_ratio                                          # Container CPU (0–1)
serverless_container_memory_usage_bytes / serverless_container_memory_limit_bytes  # Container memory
serverless_container_instances_total                                          # Container scaling
rdb_instance_postgresql_pg_stat_activity_count                                # PostgreSQL connections
rkv_cluster_redis_memory_used_bytes / rkv_cluster_redis_memory_max_bytes      # Redis memory
instance_server_agent_up                                                      # VM health
instance_server_memory_used / instance_server_memory_total                    # VM memory

Step 5 — Scaleway CLI (management plane — no token needed)

scw containers container list -o json | jq '[.[] | {name, status, min_scale, max_scale}]'  # Container status & scaling
scw rdb instance list -o json | jq '[.[] | {name, status, node_type}]'                     # PostgreSQL health
scw redis cluster list -o json | jq '[.[] | {name, status, node_type}]'                    # Redis health
scw instance server list -o json | jq '[.[] | {name, state, public_ip}]'                   # VM (Authentik) health
scw cockpit data-source list -o json | jq '[.[] | {name, type, synchronized_with_grafana}]'  # Cockpit datasources

Step 6 — SSH into Authentik VMs (when logs/metrics aren't enough)

The Authentik instances are plain Scaleway VMs (Ubuntu 24.04). SSH in as root to inspect docker, journald, or disk directly. Host aliases and IPs: see SSH_AUTHENTIK_STAGING / SSH_AUTHENTIK_PROD in ../../shared/infra-map.md.

ssh <SSH_AUTHENTIK_STAGING — see infra-map> 'docker ps'
ssh <SSH_AUTHENTIK_PROD — see infra-map> 'journalctl -u docker --since "1 hour ago"'

Both aliases live in ~/.ssh/config (User root, auth via the Bitwarden SSH agent at SSH_AGENT_SOCK — see infra-map). The infra-map's Maintenance section covers agent troubleshooting (ssh-add -l, locked agent) and how to refresh an IP after an instance rebuild.

Grafana (visual exploration only)

Dashboards: GRAFANA_DASHBOARDS (see ../../shared/infra-map.md) If datasources appear empty: scw cockpit grafana sync-data-sources

Final report

stack-golem:observe-platform report
  Issue:        <what was investigated>
  Source:       <Loki / Prometheus / Scaleway CLI / SSH>
  Findings:     <logs / metrics summary>
  Diagnosis:    <root cause or status>
  Token:        deleted ✓

Hard rules

  • Always delete the temporary cockpit token after querying — even on error paths.
  • Use only the scopes you need when creating tokens.
  • Never git commit, git push, or git rebase.
  • Never punt to Grafana for something queryable via CLI/API.
  • Never store or echo the token secret beyond the ephemeral shell variable.
Install via CLI
npx skills add https://github.com/g-bastianelli/nuthouse --skill observe-platform
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
g-bastianelli
g-bastianelli Explore all skills →