tn-debug-e2e - SKILL.md Agent Skill

name: tn-debug-e2e description: | Debug failing end-to-end tests in the telcoin-network repo (Narwhal/Bullshark consensus + Reth EVM). Trigger when the user shares e2e stdout/stderr, mentions a failing e2e test, asks about test_logs, or pastes node traces, consensus errors, or execution-engine failures. Covers panics, timeouts, races, epoch boundaries, restart failures, consensus hangs.

Debug E2E Tests - Telcoin Network

You are debugging end-to-end test failures in the telcoin-network repo, a DAG-based blockchain protocol (Narwhal/Bullshark consensus + EVM execution on Reth). The primary concern is race condition bugs, though test harness issues occasionally surface.

Overview

E2e tests live in crates/e2e-tests/. They spawn 4-6 validator processes and test consensus, epoch transitions, node restarts, and state sync. Logs are saved to crates/e2e-tests/test_logs/<test_name>/ with per-node stdout and stderr files.

Race conditions in this codebase typically stem from:

Timing gaps between synchronous consensus state updates and async engine processing
Shutdown ordering where data is saved to DB but not forwarded on channels
Channel subscription windows where messages are lost during epoch transitions
Concurrent access to shared state without proper per-entity locking
Broadcast channel lag causing slow receivers to silently drop messages

These are often symptoms of architectural complexity. Solutions should simplify, not add more coordination.

Step 1: Parse the Failure

From the user's stdout/stderr, extract:

Test name (e.g., epoch_boundary, restarts, reconnect, epoch_sync, late_join, observer, restarts_delayed, restarts_lagged_delayed)
Failure type: panic, assertion failure, timeout, hang, unexpected state
Error messages: the specific error text, panic location, or assertion details
Timing clues: how far into the test it failed (which epoch, which round, which node)

If the user hasn't provided enough context, ask for the full test output or point them to the log directory.

Step 2: Parallel Investigation

Launch subagents to investigate in parallel. Keep each subagent focused on one concern to minimize context window usage.

Subagent 1: Log Analysis

Analyze the e2e test logs for the failing test.

Read the log files in crates/e2e-tests/test_logs/<test_name>/.
Focus on:
- ERROR and WARN level messages across all nodes
- Timing of events: when did each node start, reach consensus, execute blocks
- The last meaningful events before the failure
- Any signs of: channel lag, missed messages, hung waits, timeout expiry
- Differences in event ordering between nodes (a node that's behind or ahead)

Log format: [TIMESTAMP] [LEVEL] [TARGET]: message [field=value ...]
Node logs: node<N>-run<N>.log (stdout) and .stderr.log

Report: timeline of significant events per node, anomalies, and the specific point of failure.

Subagent 2: Test Code Analysis

Read the failing test code and the test harness.

Read:
- The specific test file in crates/e2e-tests/tests/it/ that matches the failing test
- crates/e2e-tests/src/lib.rs (test harness: node spawning, RPC setup, log configuration)
- crates/e2e-tests/tests/it/common.rs (ProcessGuard, TestSemaphore, cleanup)

Focus on:
- What the test is asserting and the sequence of operations
- Timeout values and polling intervals
- How nodes are spawned, killed, and restarted (if applicable)
- Any assumptions about ordering or timing that could be violated
- Where the test waits for conditions and what could cause those waits to fail

Report: test flow, critical timing assumptions, and potential fragility points.

Subagent 3: Relevant Source Code

Based on the error from the failing test, search the source code for the root cause.

Key areas to investigate based on the failure type:

For consensus/round issues:
- crates/consensus/primary/src/consensus_bus.rs (channel architecture, wait functions)
- crates/consensus/primary/src/network/handler.rs (vote handling, "behind" detection)

For epoch boundary issues:
- crates/node/src/manager/node/epoch.rs (epoch transitions, shutdown coordination)
- crates/consensus/executor/src/subscriber.rs (biased select, shutdown drain)

For execution/block issues:
- crates/engine/src/lib.rs (ExecutorEngine, block building)
- crates/consensus/executor/src/subscriber.rs (consensus output forwarding)

For networking issues:
- crates/network-libp2p/ (peer discovery, connection management)

For state sync issues:
- crates/state-sync/ (sync protocol)

Search for the specific error message in the codebase. Trace the code path that produces it.
Look for:
- tokio::select! without biased (potential for missed priorities)
- watch/broadcast channel send patterns (send vs send_replace)
- Lock ordering and potential deadlocks
- Missing timeout guards on wait operations
- Assumptions about channel delivery ordering

Report: the code path that leads to the failure, any concurrency hazards found.

Step 3: Synthesize and Diagnose

After subagent results return, synthesize the findings:

Correlate the timeline: Match log events across nodes with the test sequence and source code flow
Identify the race window: Pinpoint the exact timing gap or ordering violation
Trace the causal chain: From the architectural decision that created the race window, through the trigger condition, to the observable failure
Check against known patterns: Compare with previously fixed race conditions in this repo (see reference file)

Step 4: Explain and Solve

Present findings in this structure:

Root Cause

What happened and why, explained from the architecture down to the specific code
The race window: what two (or more) concurrent operations are competing
Why the current design allows this race

Evidence

Specific log entries, timestamps, and code locations that demonstrate the issue
How the failure differs from the successful case

Solution

Solutions should follow these principles for this codebase:

Simplify over coordinate - If the fix requires adding another mutex, channel, or synchronization point, consider whether the architecture can be simplified instead. Complex coordination is the source of most race conditions here.
Use biased select for shutdown paths - tokio::select! { biased; } ensures data processing completes before shutdown signals are handled. This pattern has resolved multiple issues in this codebase.
Separate "saved" from "forwarded" tracking - Don't assume that writing to DB means the downstream consumer received the data. Track forwarding progress explicitly.
Per-entity locking over global locking - When concurrent operations on different entities are safe, use per-entity locks (e.g., HashMap<Id, TokioMutex<_>>) to maximize parallelism.
Watch channels for state, broadcast for events - Use watch with send_replace() for latest-value state. Use broadcast for event streams, but handle Lagged errors explicitly.
Include committed state in "behind" calculations - When checking if a component is behind, consider all state sources (execution round, processed consensus round, committed round).
Add timeout guards on waits - Even "should never block" operations need timeouts as safety nets.
Phase 2 recovery scans - For critical data paths, add startup/recovery scans that compare DB state against forwarding state to catch anything that was persisted but not delivered.

Provide:

The specific code changes needed (with file paths and line numbers)
Why this solution addresses the root cause rather than masking symptoms
Any broader architectural improvements that would prevent similar issues

Key Architecture Context

Read references/architecture.md for the full crate map and data flow if you need deeper context on how the system fits together.

Read references/race-conditions.md for documented patterns of previously fixed race conditions — check whether the current issue matches a known pattern before proposing novel solutions.