name: dial9-s3-analysis description: Analyze dial9 Tokio runtime traces stored in S3 buckets. Use when a user provides an S3 bucket containing dial9 traces and wants to understand runtime behavior, diagnose performance issues, or explore what data is available.
dial9 S3 Bucket Trace Analysis
Overview
This skill guides you through analyzing dial9 trace data stored in S3. The workflow has three phases:
- Discovery — explore the bucket to show what services, hosts, and time ranges are available
- Retrieval — download and decompress trace files
- Analysis — run the analysis toolkit to produce diagnostic reports
Prerequisites
- AWS CLI configured with read access to the target bucket
dial9CLI installed (cargo install dial9orcargo binstall dial9)- Node.js 14+ for running the analysis toolkit
Phase 1: Discovery
Present the user with what's in the bucket before doing any analysis.
Discover bucket structure
# List date range
aws s3 ls s3://BUCKET/ --region REGION
# List services for a given date/hour
aws s3 ls s3://BUCKET/YYYY-MM-DD/HHMM/ --region REGION
# List all unique service/host/runtime combinations
aws s3 ls s3://BUCKET/ --recursive --region REGION \
| awk '{print $4}' | awk -F'/' '{if (NF>=5) print $3"/"$4"/"$5}' | sort -u
Expected key structure
dial9 S3 uploads follow this layout:
{prefix/}{YYYY-MM-DD}/{HHMM}/{service_name}/{hostname}/{boot_id}/{epoch_secs}-{segment_index}.bin.gz
| Component | Meaning |
|---|---|
prefix |
Optional. Value of DIAL9_S3_PREFIX (default: none, keys start at date). |
YYYY-MM-DD |
UTC date. |
HHMM |
UTC hour+minute bucket (rotation time determines granularity — default 60s means most keys land on the hour). |
service_name |
Value of DIAL9_SERVICE_NAME or the binary name. |
hostname |
Machine hostname (e.g. ip-10-0-3-249.ec2.internal). |
boot_id |
4 random alpha chars + PID, generated at process start (e.g. nygg-1). Disambiguates restarts on the same host. |
epoch_secs-segment_index |
Unix timestamp of segment start + segment sequence number. |
Present findings to user
After discovery, present:
- Date range available
- Services found
- Number of hosts (grouped by subnet if applicable)
- Approximate data density (quiet vs busy periods — check file sizes)
Ask the user which host/time period they want to investigate, or if they want a fleet-wide overview.
Phase 2: Retrieval
Download trace files
# Single file
aws s3 cp s3://BUCKET/path/to/file.bin.gz /tmp/d9-traces/ --region REGION
# All files for a host in a time window
aws s3 cp s3://BUCKET/YYYY-MM-DD/HHMM/service/host/ /tmp/d9-traces/ \
--recursive --region REGION
Decompress
analyze.js requires decompressed .bin files:
gunzip /tmp/d9-traces/*.gz
Note: If writing custom scripts with parseTrace() directly, it handles .bin.gz files transparently — decompression is only needed for the analyze.js CLI.
Phase 3: Analysis
Extract the toolkit
dial9 agents toolkit /tmp/d9-toolkit
Run automated analysis
# Single file
node /tmp/d9-toolkit/analyze.js /tmp/d9-traces/file.bin
# All files in a directory
node /tmp/d9-toolkit/analyze.js /tmp/d9-traces/
# Large datasets: sample a subset
node /tmp/d9-toolkit/analyze.js /tmp/d9-traces/ --sample 50
Interpret results
The analyzer reports:
| Section | What to look for |
|---|---|
| Setup diagnostic | Missing data sources (scheduling events, CPU profiling) |
| Worker utilization | Imbalanced workers, low utilization (underloaded) or >95% (saturated) |
| Long polls | Polls >1ms indicate blocking work on the runtime; >10ms is critical |
| Scheduling delays | Wake-to-poll latency >1ms means tasks waiting in queue |
| Poll duration by spawn | Which code paths are slowest |
| CPU hotspots | Where CPU time is actually spent (requires CPU profiling enabled) |
| Queue depth | High global queue = workers can't keep up |
| Kernel scheduling | High kernel wait = noisy neighbors or CPU contention |
When to use other skills
After running the automated analysis:
- dial9-trace-recipes: Answer specific diagnostic questions (task leaks, blocking calls, wake chains)
- dial9-red-flags: Quick automated health check with fix suggestions
- dial9-runtime: Understand runtime behavior from first principles
- dial9-trace-loading: Parse traces programmatically for custom analysis
dial9 agents skill dial9-trace-recipes
dial9 agents skill dial9-red-flags
Choosing what to analyze
| Goal | What to pull |
|---|---|
| "Is the service healthy?" | One recent file from any host |
| "Something happened at time X" | All files from the relevant HHMM bucket |
| "Compare hosts" | Same time period from multiple hosts |
| "Track down a latency spike" | Files from the specific hour on the affected host |
| "Fleet overview" | One file per host from the same time window |
Tips
- File size indicates load: Quiet periods typically produce ~35-45KB files; busy periods produce 1-5MB+ files per segment
- Multiple segments per hour: Under load, trace rotation produces many files per time bucket — analyze them together by pointing
analyze.jsat the directory - Boot IDs are per-process: The 4-char ID (e.g.
nygg) is generated at process start. After a restart or deploy, the same host gets a new boot_id - Epoch in filename: The leading number in the filename is the Unix timestamp when that segment started — use it to pick the right file for a time window
- Large time windows: For fleet-wide analysis across hundreds of files, use
--sample 50to analyze a representative subset
Troubleshooting
- "Access Denied" or "NoSuchBucket": Verify credentials with
aws sts get-caller-identityand check bucket region - Empty bucket listings: Verify date format is YYYY-MM-DD, region is correct, and prefix matches
dial9not found:cargo install dial9orcargo binstall dial9- Analysis errors on .gz files: Decompress first —
analyze.jsrequires raw.bininput - "Unknown frame tag" errors: Toolkit version is older than the trace format — update dial9 with
cargo install dial9