name: debug-e2e description: Interactive debugging for failed e2e tests. Orchestrates the debugging session but delegates log reading to subagents to keep the main conversation clean. Use for ping-pong debugging sessions where you want to form and test hypotheses together with the user. argument-hint: <hash, PR, URL, or test name>
E2E Test Debugging
Interactive debugging for failed e2e tests. This skill orchestrates the debugging session but never reads logs directly - it delegates to subagents to keep the conversation context clean.
Invocation
The user can invoke this skill with:
- CI log hash:
/debug-e2e 343c52b17688d2cd - PR number:
/debug-e2e #19783or/debug-e2e 19783 - CI URL:
/debug-e2e http://ci.aztec-labs.com/... - Test name:
/debug-e2e epochs_l1_reorgs(for general investigation) - No argument:
/debug-e2ethen ask the user what they want to debug
When to Use
- Debugging flaky or failing e2e tests
- Investigating CI failures that need deep analysis
- When you want to collaborate with the user on forming hypotheses
- When comparing failed and successful runs
When NOT to Use
- Obvious assertion failures: If the test output clearly shows
expected 5, got 3, just investigate the code directly - Build/compilation errors: Use standard debugging, not log analysis
- Simple configuration issues: Missing env vars, wrong paths, etc.
- When user just wants a quick answer: This skill is for interactive ping-pong debugging sessions
Key Principle
Never read logs directly in this conversation. Logs can be 50k+ lines and would pollute the context. Instead:
- Use
identify-ci-failuressubagent to find failures and download logs - Use
analyze-logssubagent to deep-dive specific logs - Work with the summaries they return
Workflow
Step 1: Identify Failures
Spawn the identify-ci-failures subagent:
Use Task tool with subagent_type: "identify-ci-failures"
Prompt: "Identify CI failures for [PR number / CI URL / hash]"
This returns:
- List of failures with types
- Local file paths for downloaded logs (e.g.,
/tmp/<hash>.log) - History URL for finding successful runs
Step 2: Discuss with User
Present findings to the user:
- What tests failed?
- What type of failure (timeout, assertion, error)?
- Form initial hypotheses together
Step 3: Deep Dive with analyze-logs
Spawn the analyze-logs subagent with the local file path:
Use Task tool with subagent_type: "analyze-logs"
Prompt: "Analyze /tmp/<hash>.log focusing on test '<test_name>'. Look for [specific thing based on hypothesis]"
For comparison:
Prompt: "Compare /tmp/<failed>.log with /tmp/<success>.log for test '<test_name>'. Find divergence points."
Step 4: Refine Hypothesis
Based on the summary:
- Does the evidence support the hypothesis?
- What contradicts it?
- What new questions arise?
Discuss with user, then spawn another analyze-logs if needed.
Step 5: Investigate Codebase
Once you have a theory, search the codebase:
- Use Grep to find where specific log messages are generated
- Read the code context around log emission points
- Trace execution paths
Step 6: Suggest Fix or Local Test
Either:
- Propose a code fix based on findings
- Suggest running the test locally to verify:
yarn workspace @aztec/end-to-end test:e2e <file>.test.ts -t '<test name>'
Hypothesis Formation
Take time to think deeply before proposing theories.
For each hypothesis:
- Clearly state the theory: "The test fails because X happens when Y"
- Identify expected evidence: "If this is correct, we should see log entries for Z"
- Ask analyze-logs to verify: Spawn subagent to look for specific evidence
- Look for contradictions: What would disprove this theory?
- Assign confidence: high / medium / low based on evidence
Formulate multiple competing hypotheses when the cause is unclear.
Investigation Principles
- Be systematic: Follow the workflow, don't jump to conclusions
- Be evidence-based: Every theory must be backed by log entries or code
- Be critical: Actively seek to disprove your own hypotheses
- Be thorough: Check timing, sequence, missing events, code context
- Be clear: Use specific timestamps and quotes from summaries
- Be practical: Suggest fixes that address root causes
History Investigation
To understand when a test started failing:
- Look for the
history:marker at the beginning of the log file (first few lines) - The history shows recent runs of this exact test with PASSED/FAILED/FLAKED status:
01-23 17:10:11: PASSED (2614d91ec48f4047): ... (Author: commit message (#PR)) 01-23 17:08:30: FLAKED (10d5f47f04025f1c): ... (code: 1) group:e2e-p2p-epoch-flakes (Author: commit message (#PR)) 01-23 16:51:21: FLAKED (512e978edff9e471): ... (code: 1) group:e2e-p2p-epoch-flakes (Author: commit message (#PR)) - Identify the transition point where test started failing/flaking
- Check the PR mentioned in the commit message to understand what changed
- Download logs from both passing and failing runs to compare:
- Use hash from history (e.g.,
2614d91ec48f4047for passed,10d5f47f04025f1cfor failed) yarn ci dlog <hash> > /tmp/<hash>.log 2>&1downloads the log to a local tmp file
- Use hash from history (e.g.,
Important: Do NOT use gh run list - the history in the log file is more accurate for this specific test.
Local Test Running
To run tests locally for verification:
# Run specific test
yarn workspace @aztec/end-to-end test:e2e <file>.test.ts -t '<test name>'
# With verbose logging
LOG_LEVEL=verbose yarn workspace @aztec/end-to-end test:e2e <file>.test.ts -t '<test name>'
# With debug logging (very detailed)
LOG_LEVEL=debug yarn workspace @aztec/end-to-end test:e2e <file>.test.ts -t '<test name>'
# With specific module logging
LOG_LEVEL='info; debug:sequencer,p2p' yarn workspace @aztec/end-to-end test:e2e <file>.test.ts -t '<test name>'
Log Structure
Timestamp Format
Logs use ISO timestamps: 2024-01-23T17:08:30.123Z - useful for correlating events across nodes.
Log Levels
ERROR- Failures, exceptionsWARN- Potential issues, recoverable problemsINFO- Key events, state transitionsVERBOSE- Detailed operational infoDEBUG- Fine-grained debugging (very noisy)
Component Prefixes
Log lines are prefixed with the component name (e.g., aztec:sequencer, aztec:p2p, aztec:archiver). These map to the Key Packages section in CLAUDE.md - use that as a reference for understanding what each component does.
Multi-Node Debugging
E2E tests often spawn multiple nodes. Key tips:
Identifying Nodes
- Look for node identifiers in log prefixes:
node-0,node-1,validator-0, etc. - Each node has its own log stream but they're interleaved in the combined output
- Ask
analyze-logsto filter by node when needed
Cross-Node Correlation
- Use timestamps to correlate events across nodes
- Look for message propagation: "Node A sends X" → "Node B receives X"
- Check for missing events: if Node A sent but Node B never received, that's a clue
Common Multi-Node Issues
- Network partition: Nodes can't reach each other
- Clock skew: Timestamps don't align, causing validation failures
- Split brain: Nodes have divergent views of state
- Message ordering: Events arrive in unexpected order
Common Failure Patterns
Timeout Failures
- Action executed, expected reaction didn't occur
- Ask analyze-logs to find: what was the last action? What was expected?
- Check for blocked operations, missing events, stuck processes
Assertion Failures
- Test expectation not met
- Compare actual vs expected values
- Trace back to find where the wrong value originated
State Pollution
- Previous test left system in bad state
- Compare logs from the failing test with the same test in isolation
- Check
beforeEach/afterEachhooks
Race Conditions
- Timing-dependent failures
- Look for operations completing in different order
- Check for missing synchronization
Example Session
User: Debug this flaky test: 343c52b17688d2cd
Claude: [Spawns identify-ci-failures subagent]
Returns: epochs_l1_reorgs.test.ts failed with timeout after 383s
Log downloaded to /tmp/343c52b17688d2cd.log
History shows test started flaking recently
Claude: The test "handles reorgs spanning epoch boundaries" timed out. Looking at
the history, this test has been flaking since yesterday. The failures correlate
with PR #19783 "adding MAXIMUM_GOSSIP_CLOCK_DISPARITY for p2p validation".
Let me check a passing run to compare. [Spawns analyze-logs for both]
User: What do you see as differences?
Claude: [Spawns analyze-logs with comparison prompt]
Summary shows: In failed run, node-2 never received epoch proof notification.
In passing run, all nodes received it within 2s of each other.
Hypothesis: The new clock disparity check may be rejecting valid messages
during epoch transitions when clocks drift slightly.
User: Can you check if there are any logs about clock validation?
Claude: [Spawns analyze-logs with specific prompt]
Found: "Rejecting gossip message: clock disparity 3.2s exceeds maximum 2s"
This confirms the hypothesis.
Claude: [Searches codebase for MAXIMUM_GOSSIP_CLOCK_DISPARITY]
Found in p2p/config.ts, default is 2 seconds. The PR added this validation
but epoch proofs can take longer to propagate. Recommend increasing the
disparity threshold for epoch-related messages.