debug-pipeline - SKILL.md Agent Skill

name: debug-pipeline description: Use when the user reports a GlassFlow pipeline failure, missing events, or unexpected behavior. Starts with component logs (via kubectl or whichever log backend the OTel Collector routes to) and only reaches for metrics when the logs are clean or ambiguous. Triggers on phrases like "pipeline is failing", "no events arriving in ClickHouse", "diagnose pipeline X", "why is pipeline Y stuck". tools: Bash, Read

Debug a GlassFlow pipeline

Localize a failure or data-flow problem on a specific pipeline. Logs are the primary diagnostic surface because they usually contain the exact error; metrics come in second as a sanity check for silent failures, throughput pain, or to confirm the fix landed. Move from broad (which component) to narrow (which line in which log) as fast as the evidence allows.

Prerequisites

A reachable GlassFlow API. Default http://localhost:8081; ask the user if remote.
Log access. One of:
- kubectl configured for the cluster, with read access to the per-pipeline namespace (pipeline-<id>) and the GlassFlow control-plane namespace (typically glassflow).
- A log backend the OTel Collector routes to (Loki, Elasticsearch, CloudWatch, Splunk, etc., depending on the deployment). Logs are routed through the Collector only when the GlassFlow chart's logging exporter has been enabled; ask the user which backend they use if they cannot say.
Optional: a Prometheus endpoint that scrapes the OTel Collector. Used only in step 4 when logs do not explain the symptom.

Inputs to gather

pipeline_id: required.
api_endpoint: optional; default http://localhost:8081.
Log source: required. Either kubectl_context (cluster name) or the URL and query syntax for the user's log backend.
A description of the symptom in the user's words (failing? slow? no rows? wrong data?). Use it to pick which component to read first.
prometheus_endpoint and metric_namespace: optional, only needed if the diagnosis lands in step 4.

Steps

1. Snapshot pipeline state and DLQ depth

Two cheap HTTP calls give the baseline. Always run both before reading any logs. Per-message failures (bad transform expression, malformed input JSON, filter eval error) do not appear in pod logs because the DLQ middleware swallows them; the only signal is the DLQ count rising. A pipeline can look perfectly healthy in the logs while the DLQ has thousands of rejected records.

ENDPOINT="<api-endpoint>"
PID="<id>"

# Control-plane state
curl -s "$ENDPOINT/api/v1/pipeline/$PID/health" | python3 -m json.tool

# DLQ depth (silent-rejection signal)
curl -s "$ENDPOINT/api/v1/pipeline/$PID/dlq/state" | python3 -m json.tool

Branch on overall_status from the health response:

Created: still reconciling, not running yet. Step 2 on the operator and per-pipeline Pod events; metrics may not exist yet.
Running: control plane is up. Continue.
Failed: control plane errored. Step 2 on the operator logs first, then the per-pipeline pods.
Stopping / Stopped / Terminating: lifecycle event in progress. Confirm with the user whether intentional.
Resuming: wait and re-poll.

Then branch on unconsumed_messages from the DLQ response:

unconsumed_messages == 0 and total_messages == 0: no records have ever been rejected. Proceed to step 2.
unconsumed_messages == 0 but total_messages > 0: rejections happened in the past but have been drained. Note the last_received_at timestamp; if it is recent, treat as an active signal and run the DLQ-consume step below.
unconsumed_messages > 0: records are currently rejected. Do not stop at "logs are clean"; the actual error is in the DLQ payload. Run the DLQ-consume step now and surface the result before doing any further investigation.

# When unconsumed_messages > 0: fetch the actual rejected records with their errors
curl -s "$ENDPOINT/api/v1/pipeline/$PID/dlq/consume?batch_size=10" | python3 -m json.tool

Each item in the response carries component (which stage rejected the record), error (the verbatim error message), and original_message (the record content). The error field usually localizes the problem completely. For example:

run transformation 5: cannot fetch source from string: stateless transform expression references a field that does not exist on the record.
convert result for column amount: unable to cast nil to float64: expression returned nil for a non-nullable column.
filter evaluation error: type string has no field source: filter expression references a nested field on a flat string.
unmarshal input data: invalid character ...: source is producing malformed JSON.

If the DLQ error explains the symptom, surface it to the user with the verbatim error and propose the fix. Skip step 3 and the rest.

2. Pull the logs for the component(s) most likely to explain the symptom

The operator stamps one Deployment per role per pipeline in the namespace pipeline-<id> (or the operator's configured namespace when chart auto=false). Pick the roles to read based on the symptom:

Symptom from the user	Read first	Read next if first is clean
"Nothing arrives in ClickHouse"	sink, then ingestor	dedup, transform, OTLP receiver (if OTLP source)
"Pipeline fails on create or edit"	operator (control-plane namespace)	api (control-plane namespace)
"Started failing this morning"	the component the user last touched	sink (most common silent failure surface)
"Backpressure / slow"	ingestor + sink	dedup if enabled, NATS pod for stream pressure
Generic / unknown	sink, then ingestor, then operator	all the rest

kubectl path

kubectl --context <ctx> -n pipeline-<id> get pods
kubectl --context <ctx> -n pipeline-<id> logs deploy/glassflow-sink-<id> --tail 200
kubectl --context <ctx> -n pipeline-<id> logs deploy/glassflow-ingestor-<id> --tail 200
# StatefulSet for dedup if enabled:
kubectl --context <ctx> -n pipeline-<id> logs sts/glassflow-dedup-<id> --tail 200

The shared OTLP receiver runs in the control-plane namespace, not the per-pipeline one:

kubectl --context <ctx> -n glassflow logs deploy/glassflow-otlp-receiver --tail 200

For lifecycle / Failed / Created issues, the operator log is where the reconciliation error surfaces:

kubectl --context <ctx> -n glassflow logs deploy/glassflow-controller-manager --tail 200

If a pod is crash-looping, add --previous to read the prior container's logs.

OTel Collector log backend path

When the deployment exports logs through the Collector to an external backend, query that backend with the standard log selectors. The Collector adds pipeline_id as an attribute on every log line it forwards. Two filter expressions cover the common backends:

Loki / LogQL: {namespace="pipeline-<id>"} | json | pipeline_id="<id>"
Elasticsearch / OpenSearch: filter on kubernetes.namespace_name: "pipeline-<id>" and attributes.pipeline_id: "<id>"
CloudWatch Logs Insights: log group /aws/containerinsights/<cluster>/application, filter fields @message | filter kubernetes.namespace_name = "pipeline-<id>"

Ask the user for the exact log-backend URL or saved query if they have one.

3. Read the logs for known error patterns

The most common failure modes and the log lines that announce them:

Component	Pattern in the log	What it means
ingestor	`consumer group rebalance failed`, `OFFSET_OUT_OF_RANGE`	Kafka topic was rotated/recreated, or the consumer group offsets are stale. Reset via `kafka-consumer-groups --reset-offsets`.
ingestor	`SASL authentication failed`, `EOF reading SASL`	Kafka credentials are wrong or expired.
dedup	`batch processing failed ... failed to compile transformation`	A stateless transform expression no longer compiles against the current schema version. The component signals itself to stop; the pipeline transitions to `Failed`. Roll back the schema version or fix the expression and re-edit the pipeline.
dedup	`batch processing failed ... signal to stop component is sent`	Same root cause as above: the component just sent its fatal signal and is shutting down. Confirm with the operator log to see the signal arrive.
dedup	`batch processing failed ... schema id is missing in header`	Messages arrive at dedup without the upstream `gf-schema-version-id` header. Indicates the ingestor or a chained component is misconfigured.
dedup	`badger: ErrConflict`, `value log out of space`	BadgerDB PVC is full or under contention. Grow `resources.transform[].storage.size`.
dedup	`dlq_records_written_total{reason="dedup_overflow"}` rate > 0 with no error in the pod log	Per-message filter, transform, or dedup failures are being silently routed to the DLQ. They do not appear in pod logs because the component routes them via the DLQ middleware and continues. Inspect the DLQ contents to see the actual errors (`unmarshal input data: ...`, `run transformation N: cannot fetch X from string`, `convert result for column X: unable to cast Y to Z`, `filter evaluation error: ...`).
sink	`code: 60 ... unknown table`	The destination table does not exist. Create it before retrying.
sink	`code: 47 ... unknown identifier`, `cannot parse string as <type>`	Sink mapping references a column that does not exist or the column type does not match the source field.
sink	`clickhouse: read tcp ... i/o timeout`	ClickHouse is unreachable or overloaded; sink will NACK and retry.
OTLP receiver	`missing x-glassflow-pipeline-id header`	OTLP client is misconfigured; the receiver rejects every request.
operator	`failed to reconcile`, `image pull backoff`	Reconciliation cannot finish; check Pod events for the underlying cause.

If a log line directly explains the symptom, surface it to the user with the verbatim error and propose the fix. Skip step 4.

Special case: per-message expression failures. The dedup component (which hosts the filter, dedup, and stateless transform stages) does not log per-message failures to the pod's stdout. A bad expression hitting a single record routes that record to the DLQ and keeps the pipeline running. The user sees this only in DLQ contents or metrics, not in kubectl logs. This is exactly what step 1's DLQ check catches; if you reached step 3 without seeing DLQ activity, this case is ruled out.

4. Cross-check with metrics when logs are clean or ambiguous

Logs do not catch silent failures (records routed to DLQ without raising an error, slow drift in throughput, backpressure that has not yet caused timeouts). When step 3 turned up nothing actionable, the metric surface confirms whether data is actually moving.

PROM="<prometheus_endpoint>"
NS="<metric_namespace>"

# Source rate (Kafka)
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=rate(${NS}_gfm_kafka_records_read_total{pipeline_id=\"$PID\"}[5m])"

# Sink rate
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=rate(${NS}_gfm_clickhouse_records_written_total{pipeline_id=\"$PID\"}[5m])"

# DLQ rate by reason (silent rejection surface)
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=sum by (reason) (rate(${NS}_gfm_dlq_records_written_total{pipeline_id=\"$PID\"}[5m]))"

# Backpressure right now (1 = active, 0 = clear)
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=${NS}_gfm_ingestor_backpressure_active{pipeline_id=\"$PID\"}"
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=${NS}_gfm_component_backpressure_active{pipeline_id=\"$PID\"}"

# Stream fullness
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=${NS}_gfm_stream_depth_ratio{pipeline_id=\"$PID\"}"

What the patterns mean:

Source rate > 0, sink rate 0, no log errors: records are stuck somewhere between source and sink. Check the dedup or transform component logs again with a longer tail, then check NATS pod logs for publish failures.
DLQ rate > 0, no sink errors in logs: records are being silently rejected before reaching the sink. The reason label tells you which classifier triggered. The /dlq/consume endpoint (already covered in step 1 above) is the source of truth for the actual error per record; the metric rate just tells you how fast the rejections are happening. Note that reason="dedup_overflow" is a catch-all for any failure inside the dedup component, including bad filter expression, transform expression error, type-conversion failure, or actual BadgerDB-state overflow. The label alone does not tell you which one; you have to read the per-record error.
Backpressure active but no errors: the pipeline is keeping up degraded, not failing. This is a tune-pipeline job, not a debug job.
stream_depth_ratio sustained > 0.8: the stream is filling because the downstream consumer is slower than the source.

For component-level detail when needed:

# Per-stage processing latency (p95)
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=histogram_quantile(0.95, sum by (stage, le) (rate(${NS}_gfm_processing_duration_seconds_bucket{pipeline_id=\"$PID\"}[5m])))"

# Processor message counts by status (ok / error / filtered / duplicate)
curl -sG "$PROM/api/v1/query" \
  --data-urlencode "query=sum by (component, status) (rate(${NS}_gfm_processor_messages_total{pipeline_id=\"$PID\"}[5m]))"

5. Output

Report a structured diagnosis:

Pipeline:    <id>
Status:      <overall_status>
DLQ:         <unconsumed_messages> unconsumed / <total_messages> total
             last_received_at=<timestamp>
Bottleneck:  <ingestor | dedup | join | sink | source | none>
Evidence:
  primary error:    "<verbatim DLQ error OR pod log line>"
  source:           <DLQ component / pod role / metric name>
  metric backup:    <only when reached: rate / backpressure values>
Next move: <run tune-pipeline | check kafka creds | drain DLQ | fix CH schema | escalate>

Always report DLQ counts even when zero (the absence is a positive signal). Always cite the verbatim error (DLQ payload, log line, or metric value) that drove the conclusion. Do not propose destructive actions (delete, restart, purge) without explicit user confirmation. The skill diagnoses; the user authorizes the fix.

Do not report "no problem found" while unconsumed_messages > 0. A non-zero DLQ is a problem; if you cannot explain it from the DLQ payload, escalate to the user with the evidence and let them decide what to do.

Common pitfalls

overall_status: Running does not mean events are flowing. It only confirms the control plane stamped the workloads. If logs are clean, step 4 metrics are the next check.
Pod naming varies. The operator's exact Deployment/StatefulSet naming pattern can drift between versions. Always kubectl get pods -n pipeline-<id> before constructing log commands.
--previous is essential for crash loops. A pod restarting every few seconds has its current container too young to log anything useful; the previous container's logs hold the actual error.
The operator log is in the control-plane namespace, not the per-pipeline one. Forgetting this hides Failed and Created root causes.
DLQ uses ack-on-read. A crash between read and ack-of-processing can lose DLQ records. Cumulative counters (total_messages) survive; live depth (unconsumed_messages) does not.
Encryption key absent silently disables encryption. A pipeline written without the key reads back as plaintext; one written with the key fails to decrypt if the key is rotated or removed. If the user reports "config looks corrupted", check the encryption-key Secret before deeper digging.
Stream-naming differs between api and operator. Stream names in api-side logs do not match stream names in operator-side logs or the stream_depth metric. Confirm which side produced the name before assuming you know which stream the user means.
The DLQ reason label dedup_overflow is overloaded. Every dedup-component failure (filter expression error, transform expression error, type-conversion failure, BadgerDB state issue, etc.) lands under this single label. Do not infer the cause from the label alone; read the per-record error in the DLQ payload.