eagle3-review-logs - SKILL.md Agent Skill

name: eagle3-review-logs description: > Review EAGLE3 pipeline experiment logs from the launcher's experiments/ directory. Summarizes pass/fail status for all 4 tasks, diagnoses failures with root causes and fixes, and flags warnings. Use when the user asks to review job logs, check experiment results, or diagnose why a specific task failed. user_invocable: true

Review EAGLE3 Experiment Logs

Analyze output logs from an EAGLE3 pipeline run launched via launch.py or slurm.py.

Step 0 — Find experiment logs

Locate the experiment directory. The default is experiments/ relative to the launcher root, or wherever --job-dir was pointed.

ls -td experiments/cicd/cicd_* | head -10

If no experiments exist, ask the user for the directory.

Step 1 — Read all task logs

Each experiment has one subdirectory per task (0–3). Log filenames vary by launch mode (Slurm writes sbatch_*.out, local Docker writes *.log), so match log files generally and read the tail of each in a single Bash call — errors surface at the end:

find experiments/<exp_id>/ -type f \( -name '*.out' -o -name '*.log' \) | sort | while read -r f; do
  echo "=== $f ==="; tail -200 "$f"; echo
done

Step 2 — Analyze

For each task log, check:

Exit / cancellation: DUE TO TIME LIMIT, FAILED, signal (e.g., signal 15)
Python exceptions / tracebacks: last exception is usually the root cause
CUDA errors: OOM, NCCL timeout
Slurm state: COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY
Success indicators: "Saved N samples", "Successfully processed N conversations", training loss line, AR output

Step 3 — Produce report

Output a structured markdown report:

Summary

Overall status: PASSED / FAILED / MIXED / PARTIAL
Task breakdown: e.g., task_0 TIMEOUT, task_1 FAIL, task_2 skipped, task_3 skipped

Task Results

For each task (0–3):

Task N — <name>: PASS / FAIL / TIMEOUT

Key output: (e.g., "3277/3295 samples generated" or "Script not found")
Error (if failed): quoted error message, max 10 lines
Root cause: one-line diagnosis
Suggested fix: actionable step

Warnings

Non-fatal issues worth noting (near-OOM, tokenizer warnings, slow throughput).

Step 4 — Suggest next steps

Based on results:

If a task failed due to a known issue, suggest the fix and how to re-run from that task:

uv run launch.py --yaml examples/<Org>/<Model>/hf_offline_eagle3.yaml \
    pipeline.task_0.skip=true \
    --yes

If the failure pattern looks new, suggest capturing it in the team's internal triage tracker, and use /eagle3-triage for a deeper diagnosis.
If all tasks passed, suggest running /eagle3-validate to confirm AR meets threshold.

Known benign patterns (do NOT mark as failures)

Pattern	Explanation
vLLM server exit code 143	SIGTERM — server was killed after queries completed. Expected.
`CANCELLED AT ... DUE TO TASK FAILURE` after `exit code: 0`	Slurm cleanup of worker nodes after main task succeeded.
`destroy_process_group() was not called`	Benign PyTorch shutdown warning.
`tokenizer class ... not equal to the registered tokenizer class`	Harmless tokenizer mismatch warning.