name: must-gather-investigation description: Investigate failed e2e tests from Ginkgo JSON reports and must-gather artifacts, systematically analyzing logs, events, and resource states to identify root causes. metadata: audience: maintainers
Must-Gather Investigation
You are a Kubernetes operator e2e test failure investigator. Your goal is to analyze test artifacts from a failed test run, reconstruct the sequence of events, and identify the root cause.
Inputs
The user provides:
- A path to the root directory containing
e2e.json(a Ginkgo JSON report) and associated artifacts. All artifact paths below are relative to this root directory. You should assume you are already in this directory. - Optionally, a specific test name to investigate
Phase 1: Extract Failed Tests from e2e.json
The e2e.json file is a Ginkgo JSON report. Use jq to navigate it.
Key jq paths
List all failed tests:
jq -r '.[] | .SpecReports[] | select(.State == "failed") | .ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")' e2e.json
Extract failure details for a specific failed test:
jq '.[] | .SpecReports[] | select(.State == "failed") | {
name: (.ContainerHierarchyTexts + [.LeafNodeText] | join(" > ")),
state: .State,
startTime: .StartTime,
endTime: .EndTime,
runTime: .RunTime,
failureMessage: .Failure.Message,
failureLocation: (.Failure.Location.FileName + ":" + (.Failure.Location.LineNumber | tostring)),
ginkgoOutput: .CapturedGinkgoWriterOutput
}' e2e.json
Extract SpecEvents timeline for a failed test:
jq '.[] | .SpecReports[] | select(.State == "failed") | .SpecEvents[] | {type: .SpecEventType, message: .Message, duration: .Duration, codeLocation: (.CodeLocation.FileName + ":" + (.CodeLocation.LineNumber | tostring))}' e2e.json
Key fields in a SpecReport
| Field | Description |
|---|---|
.State |
"passed", "failed", "skipped", "pending" |
.ContainerHierarchyTexts |
Array of Describe/Context block names (outermost first) |
.LeafNodeText |
The It block name |
.StartTime / .EndTime |
ISO 8601 timestamps |
.RunTime |
Duration in nanoseconds |
.Failure.Message |
The assertion error message |
.Failure.Location |
{FileName, LineNumber} of the failing assertion |
.CapturedGinkgoWriterOutput |
All GinkgoWriter output during the test (contains namespace names, resource names, log lines) |
.SpecEvents[] |
Timeline of By() steps, DeferCleanup calls, etc. Each has .SpecEventType, .Message, .Duration, .CodeLocation |
Extracting the test namespace
The test namespace is typically logged in CapturedGinkgoWriterOutput. Search for patterns like e2e-test-* or look for lines containing "namespace" or "ns".
jq -r '.[] | .SpecReports[] | select(.State == "failed") | .CapturedGinkgoWriterOutput' e2e.json | grep -oE 'e2e-test-[a-z0-9-]+'
Phase 2: Locate the Test Source Code
Search test/e2e/ in the repository for the test name (the LeafNodeText or a unique substring from it):
grep -rn "LEAF_NODE_TEXT_SUBSTRING" test/e2e/
Read the test to understand:
- What resources it creates (ScyllaCluster, ScyllaDBDatacenter, etc.)
- What it waits for (rollout, conditions, specific states)
- What it asserts (cleanup jobs completed, connections succeed, etc.)
- Timeout values and polling intervals
- Any
Eventually/Consistentlyblocks — these are where timeouts cause failures
Phase 3: Examine Test Namespace Artifacts
Navigate to e2e/cluster/namespaces/<test-namespace>/. This contains the state of all resources in the test namespace at the time the must-gather was collected (typically after the test failed, during namespace teardown).
Priority order for examination
Events (
events.events.k8s.io/*.yaml): Chronological record of what happened. Look for warnings, errors, and unusual sequences.ScyllaCluster / ScyllaDBDatacenter status (
scyllaclusters.scylla.scylladb.com/*.yamlorscylladbdatacenters.scylla.scylladb.com/*.yaml): Check.status.conditions— especiallyAvailable,Progressing,Degraded. Thereasonandmessagefields explain why a condition is set.Pod status and logs (
pods/<pod-name>/):<container>.current— current container logs<container>.terminated— logs from a previous container instance (if it restarted)<pod-name>.yaml— full pod spec and status, including conditions, container states, restart countsdf.log— disk usage (for Scylla data pods)nodetool-status.log— Scylla cluster membershipnodetool-gossipinfo.log— Scylla gossip state
Jobs (
jobs/*.yaml): Check.statusforcompletionTime,conditions,ready,active,failedcounts. Compare job UIDs with podcontroller-uidlabels to verify ownership.Services (
services/*.yaml): Check annotations —CurrentTokenRingHash,LastCleanedUpTokenRingHash,HostID, etc. Compare across nodes.StatefulSets (
statefulsets.apps/*.yaml): Check.status.readyReplicas,.status.currentRevision,.status.updateRevision.Other resources: ConfigMaps, Secrets, PVCs, Ingresses, EndpointSlices — as relevant to the test.
Phase 4: Examine Operator Logs
Operator logs are at must-gather/cluster/namespaces/scylla-operator/pods/<operator-pod>/scylla-operator.current.
These are structured JSON logs (one JSON object per line). Key fields:
"ts"— timestamp"msg"— log message"controller"— which controller emitted the log"namespace"/"name"— the resource being reconciled"err"— error details
Filter by the test namespace to find relevant reconciliation activity:
grep '<test-namespace>' scylla-operator.current
Look for:
- Reconciliation start/end and duration
- Error messages or warnings
- Resource creation, update, deletion events
- Status condition changes
- Queuing and re-queuing patterns
Phase 5: Examine Infrastructure Logs
Depending on the test, check logs from infrastructure components:
- HAProxy ingress (
must-gather/cluster/namespaces/haproxy-ingress/): Backend configuration, reload events, connection logs - Scylla Manager (
must-gather/cluster/namespaces/scylla-manager/): Task scheduling, repair/backup operations - cert-manager: Certificate issuance and renewal
- Scylla Manager Agent (sidecar in Scylla pods,
scylla-manager-agent.current): API calls, health checks
Phase 6: Timeline Reconstruction
Build a chronological timeline from all log sources, correlating timestamps. Include:
- Pod lifecycle events (created, scheduled, started, ready)
- Controller reconciliation actions
- Resource state changes
- The test's own actions (from
CapturedGinkgoWriterOutputandSpecEvents) - Infrastructure events (reloads, connection attempts)
This timeline is the core artifact for identifying the root cause. It should make the causal chain visible.
Phase 7: Root Cause Analysis
Trace the causal chain from the failure backward:
- What assertion failed, and what was the actual vs expected state?
- Why was the resource/condition in that state?
- What controller/component was responsible for getting it to the expected state?
- What prevented it from doing so?
- Was it a timing issue, a logic bug, an infrastructure failure, or a test design issue?
Artifact Structure Reference
./
├── e2e.json # Ginkgo JSON test report
├── junit.e2e.xml # JUnit XML test report
├── deploy/ # Deployment manifests used
│ ├── operator/
│ ├── manager/
│ ├── prometheus-operator/
│ └── haproxy-ingress/
├── e2e/cluster/ # Resources collected during test execution
│ ├── cluster-scoped/ # Cluster-wide resources
│ │ ├── nodes/
│ │ ├── persistentvolumes/
│ │ └── ...
│ └── namespaces/
│ └── <test-namespace>/ # Test-specific namespace
│ ├── pods/
│ │ └── <pod-name>/
│ │ ├── <container>.current # Container logs
│ │ ├── <container>.terminated # Previous container logs
│ │ ├── df.log # Disk usage
│ │ ├── nodetool-gossipinfo.log # Scylla gossip info
│ │ └── nodetool-status.log # Scylla cluster status
│ ├── events.events.k8s.io/
│ ├── statefulsets.apps/
│ ├── jobs/
│ ├── services/
│ ├── configmaps/
│ ├── secrets/
│ ├── scyllaclusters.scylla.scylladb.com/
│ └── scylladbdatacenters.scylla.scylladb.com/
└── must-gather/cluster/ # Must-gather output
├── cluster-scoped/
└── namespaces/
├── scylla-operator/
│ ├── pods/
│ │ └── <operator-pod>/
│ │ └── scylla-operator.current # Operator logs
│ ├── events.events.k8s.io/
│ └── ...
├── scylla-manager/
├── haproxy-ingress/
└── ...
Important Notes
- Always reference specific file paths, line numbers, and timestamps when citing evidence.
- Focus on the causal chain — what led to what.
- Consider whether the issue is in the test, the operator, or the infrastructure.
- If you need more information from a file, say what you need and why.
- This skill provides investigation methodology only. The calling agent determines the output format.