workflow-rca

star 2

Runtime evidence collection for workflow and CI incidents workflow run ids, CI job ids, GCP Cloud Logging queries, GCS artifact lookup, workflow DB artifacts, cache hit/miss proof, and worker behavior. Use when the input is a run identifier or a question about why a workflow failed, retried, slowed down, loaded cache, or built fresh. This skill gathers and proves runtime facts, hand its evidence packet to rca-investigation for the cumulative code-level root cause.

1ikeadragon By 1ikeadragon schedule Updated 6/10/2026

name: workflow-rca description: Runtime evidence collection for workflow and CI incidents workflow run ids, CI job ids, GCP Cloud Logging queries, GCS artifact lookup, workflow DB artifacts, cache hit/miss proof, and worker behavior. Use when the input is a run identifier or a question about why a workflow failed, retried, slowed down, loaded cache, or built fresh. This skill gathers and proves runtime facts, hand its evidence packet to rca-investigation for the cumulative code-level root cause. version: 1.1.0

Workflow RCA

Use this skill for workflow-centric runtime evidence collection. It is designed to work with rca-investigation: this skill gathers and proves what happened at the workflow/runtime/artifact layer; rca-investigation integrates that evidence with code, config, DB interpretation, git history, and systemic fixes.

Do not make this skill project-specific. Do not assume a default GCP project, bucket, namespace, workflow type, table schema, or cache convention unless the user, runtime logs, deployment config, or repository code proves it.

Core Rules

  • Start from the exact workflow id, run id, job id, PR number, commit, finding id, pod, namespace, timestamp, or artifact path the user gave.
  • Always say how you found each fact, with the command/query/log/artifact that proves it.
  • Separate proven runtime facts from inferred workflow behavior.
  • If the user did not explicitly ask to download a DB or artifact, find the object path and metadata first, then ask before downloading.
  • For latest/current status, trust live logs and provider APIs over memory or stale artifacts.
  • Empty results from one project, log name, namespace, or time window do not prove absence.

1. Classify The Workflow Input

Use this structure first:

Workflow input classification:
- input type: Temporal workflow id / Temporal run id / CI run URL / job id / pod+namespace / GCS object / DB path / finding id / cache key / log excerpt / timestamp only / mixed / unknown
- parsed identifiers:
- missing identifiers:
- likely platform: Temporal / CI / Kubernetes worker / batch job / serverless job / unknown
- likely artifact classes: logs / GCS objects / DB / metrics / traces / code config / git history
- first lookup path:

If the identifier is ambiguous, search for the literal string in logs/artifacts before assigning meaning.

2. GCP And Cloud Logging Workflow

When GCP is plausible, verify identity and available logging surface before widening slow or empty queries:

gcloud config list --format='text(core.project,account)'
gcloud auth list --filter=status:ACTIVE
gcloud projects list
gcloud logging logs list --project="<project>"

Project selection rules:

  • Prefer an explicit project from the user.
  • Next prefer project values from deployment config, workflow metadata, artifact paths, or logs.
  • Treat the local active gcloud project as a hint, not proof.
  • If no project is known, state that log lookup is blocked and show the exact project-dependent command to run.

Minimum literal workflow-id log search:

PROJECT="<project>"
WORKFLOW_ID="<workflow-id>"
gcloud logging read "\"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=200 --format=json
gcloud logging read "\"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=200 --format='table(timestamp,resource.type,resource.labels.namespace_name,resource.labels.pod_name,resource.labels.container_name,severity,logName)'

If a failure window is known, use timestamp bounds instead of broad freshness:

gcloud logging read "\"${WORKFLOW_ID}\" AND timestamp>=\"<start-iso>\" AND timestamp<=\"<end-iso>\"" --project="$PROJECT" --limit=500 --format=json

If logs reveal namespace, pod, worker service, task queue, run id, activity, or request id, pivot into narrower queries:

gcloud logging read "\"${WORKFLOW_ID}\" AND resource.labels.namespace_name=\"<namespace>\"" --project="$PROJECT" --freshness=14d --limit=500 --format=json
gcloud logging read "\"<run-id>\" OR \"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=500 --format=json
gcloud logging read "\"<task-queue>\" AND (\"error\" OR \"failed\" OR \"timeout\" OR \"panic\" OR \"exception\" OR \"429\")" --project="$PROJECT" --freshness=14d --limit=500 --format=json

If gcloud is flaky but auth works, use the Cloud Logging API directly:

TOKEN="$(gcloud auth print-access-token)"
curl -sS -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  "https://logging.googleapis.com/v2/entries:list" \
  -d '{"resourceNames":["projects/<project>"],"filter":"\"<workflow-id>\"","pageSize":200}'

3. Temporal Workflow Evidence

For Temporal workflows, extract these fields from logs, Temporal metadata, or worker output before declaring the incident identified:

  • namespace
  • workflow id
  • run id
  • workflow type
  • task queue
  • worker service/deployment
  • first observed timestamp
  • first failure timestamp
  • last retry/recovery timestamp
  • terminal state if known
  • exact error type and message
  • activity name, workflow task phase, or child workflow if present
  • pod/container/service instance that emitted the decisive log line

Distinguish:

  • retryable activity failure
  • workflow task failure
  • child workflow failure
  • worker crash/restart
  • external API limit or timeout
  • cache/artifact load failure
  • downstream scanner or validation failure

4. Build A Runtime Timeline

Pull enough logs before, during, and after the failure to answer:

  • current phase or terminal state
  • cache hit, parent hit, miss, bypass, or rebuild
  • major activity durations
  • worker/pod/container identity
  • error signatures such as 429, RESOURCE_EXHAUSTED, TTFTTimeoutError, config failures, checkpoint errors, upload/download failures, DB errors, and provider timeouts
  • recovery signal or lack of recovery

Separate symptoms from causes. Do not infer root cause from timestamp adjacency alone.

Use this format:

Workflow timeline:
- <timestamp> | <source> | <event> | proof:
- <timestamp> | <source> | <event> | proof:

Symptoms:
- ...

Candidate causes:
- ...

Still unknown:
- ...

5. GCS And Artifact Handling

Trace object paths from logs, workflow metadata, config, or code. Do not guess bucket layout from memory.

Before downloading, confirm metadata:

gcloud storage objects describe "gs://<bucket>/<object>"
gcloud storage ls -l "gs://<bucket>/<prefix>/**"

If gsutil is the available tool:

gsutil ls -L "gs://<bucket>/<object>"
gsutil ls -l "gs://<bucket>/<prefix>/**"

Record:

  • object path
  • size
  • generation/metageneration when available
  • updated timestamp
  • content type if useful
  • whether multiple writes/overwrites happened during the run
  • whether the object timestamp aligns with the workflow timeline

Ask before downloading unless the user explicitly requested download/DB inspection.

6. SQLite / DB Artifact Inspection

When the user approves DB download or provides a local DB path, inspect schema first:

sqlite3 "<db-path>" ".tables"
sqlite3 "<db-path>" ".schema"
sqlite3 "<db-path>" "SELECT name, type FROM sqlite_master ORDER BY type, name;"

7. Cache And Rebuild Proof

Do not say "cache hit" unless logs prove the specific cache state:

  • exact hit
  • parent hit
  • miss
  • bypass
  • stale object reuse
  • rebuild/fresh generation
  • restore failure
  • save/upload failure

If asked how to evict or bypass cache, separate:

  • bypass for one run
  • namespace or key isolation
  • actual object deletion
  • cache TTL or retention change

8. Workflow Evidence Packet

End workflow evidence collection with a packet that rca-investigation can consume:

Workflow evidence packet:
- identifiers:
- project/account/log scope:
- runtime platform:
- timeline:
- decisive log evidence:
- artifacts found:
- artifacts not downloaded:
- DB paths inspected:
- cache state:
- errors and provider signals:
- recovery signals:
- commands/queries used:
- proven:
- inferred:
- still unknown:
- recommended code/config paths to inspect next:

If a full RCA is requested, hand this packet to rca-investigation and continue with code/config/git-history proof before naming a root cause.

Install via CLI
npx skills add https://github.com/1ikeadragon/slopflow --skill workflow-rca
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator