dial9-red-flags

star 353

Automated health checks for dial9 Tokio runtime traces. Detects long polls, task leaks, scheduling delays, blocking calls, queue buildup, worker imbalance, CPU contention, and span anomalies. Use when you want a quick automated assessment of trace health.

dial9-rs By dial9-rs schedule Updated 5/21/2026

name: dial9-red-flags description: Automated health checks for dial9 Tokio runtime traces. Detects long polls, task leaks, scheduling delays, blocking calls, queue buildup, worker imbalance, CPU contention, and span anomalies. Use when you want a quick automated assessment of trace health.

Red Flags: Automated Health Checks

Run scripts/red_flag_scan.js against any trace to surface common Tokio runtime problems.

node scripts/red_flag_scan.js <trace.bin or directory>

Each finding has a severity: critical, warning, or info.

Checks performed

long-poll

A single .poll() call took too long. This blocks the worker from processing other tasks. The fixed >10ms warning / >50ms critical cutoffs here are a coarse default, not a universal truth — "long" is really relative to this runtime's own poll distribution. In a service whose p99 poll is 500µs, a 1ms poll is a severe tail outlier these cutoffs miss entirely; in a batch job whose p99 is 40ms, a 20ms poll is normal. Calibrate against pollDurationByLoc (p50/p99 per spawn location) before trusting an absolute threshold. Look at poll.cpuSamples and poll.schedSamples for stack traces. To root-cause why a flagged poll was long — especially an off-CPU one with no scheduling stacks — use the dial9-diagnose-long-poll skill (which thresholds on p99 by default), and dial9-zoom-window to inspect the surrounding instant.

task-leak

Active task count grows without bound. Tasks are spawned but never complete. Check taskSpawnLocs for spawn locations of unterminated tasks.

sched-delay

Time between Waker::wake() and the task being polled exceeds 5ms. All workers are busy. Fix: shorter polls, more workers, or yield points.

blocking-calls

Scheduling samples (source=1) reveal blocking system calls (file I/O, DNS resolution, mutex contention) on the async runtime. These should use spawn_blocking or a dedicated thread.

queue-depth

Global injection queue exceeds 100 (warning) or 1000 (critical). The runtime cannot keep up with incoming work.

worker-imbalance

Poll counts differ by more than 3x across workers. Work-stealing may not be distributing evenly, or one worker is stuck on long polls.

cpu-contention

Workers are active but spending less than 50% of wall time on CPU. The kernel is descheduling them due to CPU contention.

kernel-sched-wait

Worker unpark takes more than 1ms of kernel scheduling wait. Indicates CPU contention at the OS level.

many-spans-per-poll

A single poll contains more than 20 span enter/exit pairs. Usually a tight loop without yielding.

span-duration-outlier

A span whose duration exceeds 10x the P50 for its name. Flags individual slow operations.

unmatched-spans

Spans with enter but no exit. Small counts are normal at segment boundaries. Large counts may indicate task cancellation or a bug in span instrumentation.

Install via CLI
npx skills add https://github.com/dial9-rs/dial9 --skill dial9-red-flags
Repository Details
star Stars 353
call_split Forks 28
navigation Branch main
article Path SKILL.md
More from Creator