workflow-rca - SKILL.md Agent Skill

name: workflow-rca description: Runtime evidence collection for workflow and CI incidents workflow run ids, CI job ids, GCP Cloud Logging queries, GCS artifact lookup, workflow DB artifacts, cache hit/miss proof, and worker behavior. Use when the input is a run identifier or a question about why a workflow failed, retried, slowed down, loaded cache, or built fresh. This skill gathers and proves runtime facts, hand its evidence packet to rca-investigation for the cumulative code-level root cause. version: 1.1.0

Workflow RCA

Use this skill for workflow-centric runtime evidence collection. It is designed to work with rca-investigation: this skill gathers and proves what happened at the workflow/runtime/artifact layer; rca-investigation integrates that evidence with code, config, DB interpretation, git history, and systemic fixes.

Do not make this skill project-specific. Do not assume a default GCP project, bucket, namespace, workflow type, table schema, or cache convention unless the user, runtime logs, deployment config, or repository code proves it.

Core Rules

Start from the exact workflow id, run id, job id, PR number, commit, finding id, pod, namespace, timestamp, or artifact path the user gave.
Always say how you found each fact, with the command/query/log/artifact that proves it.
Separate proven runtime facts from inferred workflow behavior.
If the user did not explicitly ask to download a DB or artifact, find the object path and metadata first, then ask before downloading.
For latest/current status, trust live logs and provider APIs over memory or stale artifacts.
Empty results from one project, log name, namespace, or time window do not prove absence.

1. Classify The Workflow Input

Use this structure first:

Workflow input classification:
- input type: Temporal workflow id / Temporal run id / CI run URL / job id / pod+namespace / GCS object / DB path / finding id / cache key / log excerpt / timestamp only / mixed / unknown
- parsed identifiers:
- missing identifiers:
- likely platform: Temporal / CI / Kubernetes worker / batch job / serverless job / unknown
- likely artifact classes: logs / GCS objects / DB / metrics / traces / code config / git history
- first lookup path:

If the identifier is ambiguous, search for the literal string in logs/artifacts before assigning meaning.

2. GCP And Cloud Logging Workflow

When GCP is plausible, verify identity and available logging surface before widening slow or empty queries:

gcloud config list --format='text(core.project,account)'
gcloud auth list --filter=status:ACTIVE
gcloud projects list
gcloud logging logs list --project="<project>"

Project selection rules:

Prefer an explicit project from the user.
Next prefer project values from deployment config, workflow metadata, artifact paths, or logs.
Treat the local active gcloud project as a hint, not proof.
If no project is known, state that log lookup is blocked and show the exact project-dependent command to run.

Minimum literal workflow-id log search:

PROJECT="<project>"
WORKFLOW_ID="<workflow-id>"
gcloud logging read "\"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=200 --format=json
gcloud logging read "\"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=200 --format='table(timestamp,resource.type,resource.labels.namespace_name,resource.labels.pod_name,resource.labels.container_name,severity,logName)'

If a failure window is known, use timestamp bounds instead of broad freshness:

gcloud logging read "\"${WORKFLOW_ID}\" AND timestamp>=\"<start-iso>\" AND timestamp<=\"<end-iso>\"" --project="$PROJECT" --limit=500 --format=json

If logs reveal namespace, pod, worker service, task queue, run id, activity, or request id, pivot into narrower queries:

gcloud logging read "\"${WORKFLOW_ID}\" AND resource.labels.namespace_name=\"<namespace>\"" --project="$PROJECT" --freshness=14d --limit=500 --format=json
gcloud logging read "\"<run-id>\" OR \"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=500 --format=json
gcloud logging read "\"<task-queue>\" AND (\"error\" OR \"failed\" OR \"timeout\" OR \"panic\" OR \"exception\" OR \"429\")" --project="$PROJECT" --freshness=14d --limit=500 --format=json

If gcloud is flaky but auth works, use the Cloud Logging API directly:

TOKEN="$(gcloud auth print-access-token)"
curl -sS -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
  "https://logging.googleapis.com/v2/entries:list" \
  -d '{"resourceNames":["projects/<project>"],"filter":"\"<workflow-id>\"","pageSize":200}'

3. Temporal Workflow Evidence

For Temporal workflows, extract these fields from logs, Temporal metadata, or worker output before declaring the incident identified:

namespace
workflow id
run id
workflow type
task queue
worker service/deployment
first observed timestamp
first failure timestamp
last retry/recovery timestamp
terminal state if known
exact error type and message
activity name, workflow task phase, or child workflow if present
pod/container/service instance that emitted the decisive log line

Distinguish:

retryable activity failure
workflow task failure
child workflow failure
worker crash/restart
external API limit or timeout
cache/artifact load failure
downstream scanner or validation failure

4. Build A Runtime Timeline

Pull enough logs before, during, and after the failure to answer:

current phase or terminal state
cache hit, parent hit, miss, bypass, or rebuild
major activity durations
worker/pod/container identity
error signatures such as 429, RESOURCE_EXHAUSTED, TTFTTimeoutError, config failures, checkpoint errors, upload/download failures, DB errors, and provider timeouts
recovery signal or lack of recovery

Separate symptoms from causes. Do not infer root cause from timestamp adjacency alone.

Use this format:

Workflow timeline:
- <timestamp> | <source> | <event> | proof:
- <timestamp> | <source> | <event> | proof:

Symptoms:
- ...

Candidate causes:
- ...

Still unknown:
- ...

5. GCS And Artifact Handling

Trace object paths from logs, workflow metadata, config, or code. Do not guess bucket layout from memory.

Before downloading, confirm metadata:

gcloud storage objects describe "gs://<bucket>/<object>"
gcloud storage ls -l "gs://<bucket>/<prefix>/**"

If gsutil is the available tool:

gsutil ls -L "gs://<bucket>/<object>"
gsutil ls -l "gs://<bucket>/<prefix>/**"

Record:

object path
size
generation/metageneration when available
updated timestamp
content type if useful
whether multiple writes/overwrites happened during the run
whether the object timestamp aligns with the workflow timeline

Ask before downloading unless the user explicitly requested download/DB inspection.

6. SQLite / DB Artifact Inspection

When the user approves DB download or provides a local DB path, inspect schema first:

sqlite3 "<db-path>" ".tables"
sqlite3 "<db-path>" ".schema"
sqlite3 "<db-path>" "SELECT name, type FROM sqlite_master ORDER BY type, name;"

7. Cache And Rebuild Proof

Do not say "cache hit" unless logs prove the specific cache state:

exact hit
parent hit
miss
bypass
stale object reuse
rebuild/fresh generation
restore failure
save/upload failure

If asked how to evict or bypass cache, separate:

bypass for one run
namespace or key isolation
actual object deletion
cache TTL or retention change

8. Workflow Evidence Packet

End workflow evidence collection with a packet that rca-investigation can consume:

Workflow evidence packet:
- identifiers:
- project/account/log scope:
- runtime platform:
- timeline:
- decisive log evidence:
- artifacts found:
- artifacts not downloaded:
- DB paths inspected:
- cache state:
- errors and provider signals:
- recovery signals:
- commands/queries used:
- proven:
- inferred:
- still unknown:
- recommended code/config paths to inspect next:

If a full RCA is requested, hand this packet to rca-investigation and continue with code/config/git-history proof before naming a root cause.