name: workflow-rca description: Runtime evidence collection for workflow and CI incidents workflow run ids, CI job ids, GCP Cloud Logging queries, GCS artifact lookup, workflow DB artifacts, cache hit/miss proof, and worker behavior. Use when the input is a run identifier or a question about why a workflow failed, retried, slowed down, loaded cache, or built fresh. This skill gathers and proves runtime facts, hand its evidence packet to rca-investigation for the cumulative code-level root cause. version: 1.1.0
Workflow RCA
Use this skill for workflow-centric runtime evidence collection. It is designed to work with rca-investigation: this skill gathers and proves what happened at the workflow/runtime/artifact layer; rca-investigation integrates that evidence with code, config, DB interpretation, git history, and systemic fixes.
Do not make this skill project-specific. Do not assume a default GCP project, bucket, namespace, workflow type, table schema, or cache convention unless the user, runtime logs, deployment config, or repository code proves it.
Core Rules
- Start from the exact workflow id, run id, job id, PR number, commit, finding id, pod, namespace, timestamp, or artifact path the user gave.
- Always say how you found each fact, with the command/query/log/artifact that proves it.
- Separate proven runtime facts from inferred workflow behavior.
- If the user did not explicitly ask to download a DB or artifact, find the object path and metadata first, then ask before downloading.
- For latest/current status, trust live logs and provider APIs over memory or stale artifacts.
- Empty results from one project, log name, namespace, or time window do not prove absence.
1. Classify The Workflow Input
Use this structure first:
Workflow input classification:
- input type: Temporal workflow id / Temporal run id / CI run URL / job id / pod+namespace / GCS object / DB path / finding id / cache key / log excerpt / timestamp only / mixed / unknown
- parsed identifiers:
- missing identifiers:
- likely platform: Temporal / CI / Kubernetes worker / batch job / serverless job / unknown
- likely artifact classes: logs / GCS objects / DB / metrics / traces / code config / git history
- first lookup path:
If the identifier is ambiguous, search for the literal string in logs/artifacts before assigning meaning.
2. GCP And Cloud Logging Workflow
When GCP is plausible, verify identity and available logging surface before widening slow or empty queries:
gcloud config list --format='text(core.project,account)'
gcloud auth list --filter=status:ACTIVE
gcloud projects list
gcloud logging logs list --project="<project>"
Project selection rules:
- Prefer an explicit project from the user.
- Next prefer project values from deployment config, workflow metadata, artifact paths, or logs.
- Treat the local active
gcloudproject as a hint, not proof. - If no project is known, state that log lookup is blocked and show the exact project-dependent command to run.
Minimum literal workflow-id log search:
PROJECT="<project>"
WORKFLOW_ID="<workflow-id>"
gcloud logging read "\"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=200 --format=json
gcloud logging read "\"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=200 --format='table(timestamp,resource.type,resource.labels.namespace_name,resource.labels.pod_name,resource.labels.container_name,severity,logName)'
If a failure window is known, use timestamp bounds instead of broad freshness:
gcloud logging read "\"${WORKFLOW_ID}\" AND timestamp>=\"<start-iso>\" AND timestamp<=\"<end-iso>\"" --project="$PROJECT" --limit=500 --format=json
If logs reveal namespace, pod, worker service, task queue, run id, activity, or request id, pivot into narrower queries:
gcloud logging read "\"${WORKFLOW_ID}\" AND resource.labels.namespace_name=\"<namespace>\"" --project="$PROJECT" --freshness=14d --limit=500 --format=json
gcloud logging read "\"<run-id>\" OR \"${WORKFLOW_ID}\"" --project="$PROJECT" --freshness=14d --limit=500 --format=json
gcloud logging read "\"<task-queue>\" AND (\"error\" OR \"failed\" OR \"timeout\" OR \"panic\" OR \"exception\" OR \"429\")" --project="$PROJECT" --freshness=14d --limit=500 --format=json
If gcloud is flaky but auth works, use the Cloud Logging API directly:
TOKEN="$(gcloud auth print-access-token)"
curl -sS -H "Authorization: Bearer ${TOKEN}" -H "Content-Type: application/json" \
"https://logging.googleapis.com/v2/entries:list" \
-d '{"resourceNames":["projects/<project>"],"filter":"\"<workflow-id>\"","pageSize":200}'
3. Temporal Workflow Evidence
For Temporal workflows, extract these fields from logs, Temporal metadata, or worker output before declaring the incident identified:
- namespace
- workflow id
- run id
- workflow type
- task queue
- worker service/deployment
- first observed timestamp
- first failure timestamp
- last retry/recovery timestamp
- terminal state if known
- exact error type and message
- activity name, workflow task phase, or child workflow if present
- pod/container/service instance that emitted the decisive log line
Distinguish:
- retryable activity failure
- workflow task failure
- child workflow failure
- worker crash/restart
- external API limit or timeout
- cache/artifact load failure
- downstream scanner or validation failure
4. Build A Runtime Timeline
Pull enough logs before, during, and after the failure to answer:
- current phase or terminal state
- cache hit, parent hit, miss, bypass, or rebuild
- major activity durations
- worker/pod/container identity
- error signatures such as
429,RESOURCE_EXHAUSTED,TTFTTimeoutError, config failures, checkpoint errors, upload/download failures, DB errors, and provider timeouts - recovery signal or lack of recovery
Separate symptoms from causes. Do not infer root cause from timestamp adjacency alone.
Use this format:
Workflow timeline:
- <timestamp> | <source> | <event> | proof:
- <timestamp> | <source> | <event> | proof:
Symptoms:
- ...
Candidate causes:
- ...
Still unknown:
- ...
5. GCS And Artifact Handling
Trace object paths from logs, workflow metadata, config, or code. Do not guess bucket layout from memory.
Before downloading, confirm metadata:
gcloud storage objects describe "gs://<bucket>/<object>"
gcloud storage ls -l "gs://<bucket>/<prefix>/**"
If gsutil is the available tool:
gsutil ls -L "gs://<bucket>/<object>"
gsutil ls -l "gs://<bucket>/<prefix>/**"
Record:
- object path
- size
- generation/metageneration when available
- updated timestamp
- content type if useful
- whether multiple writes/overwrites happened during the run
- whether the object timestamp aligns with the workflow timeline
Ask before downloading unless the user explicitly requested download/DB inspection.
6. SQLite / DB Artifact Inspection
When the user approves DB download or provides a local DB path, inspect schema first:
sqlite3 "<db-path>" ".tables"
sqlite3 "<db-path>" ".schema"
sqlite3 "<db-path>" "SELECT name, type FROM sqlite_master ORDER BY type, name;"
7. Cache And Rebuild Proof
Do not say "cache hit" unless logs prove the specific cache state:
- exact hit
- parent hit
- miss
- bypass
- stale object reuse
- rebuild/fresh generation
- restore failure
- save/upload failure
If asked how to evict or bypass cache, separate:
- bypass for one run
- namespace or key isolation
- actual object deletion
- cache TTL or retention change
8. Workflow Evidence Packet
End workflow evidence collection with a packet that rca-investigation can consume:
Workflow evidence packet:
- identifiers:
- project/account/log scope:
- runtime platform:
- timeline:
- decisive log evidence:
- artifacts found:
- artifacts not downloaded:
- DB paths inspected:
- cache state:
- errors and provider signals:
- recovery signals:
- commands/queries used:
- proven:
- inferred:
- still unknown:
- recommended code/config paths to inspect next:
If a full RCA is requested, hand this packet to rca-investigation and continue with code/config/git-history proof before naming a root cause.