eagle3-validate - SKILL.md Agent Skill

name: eagle3-validate description: > Validate that an EAGLE3 pipeline run completed successfully end-to-end. Checks all 4 steps produced expected artifacts, verifies acceptance rate meets threshold (>= 2.1), and produces a summary report. Use when user wants to verify a pipeline run or check benchmark results. user_invocable: true

EAGLE3 Pipeline Validation

Verify that an EAGLE3 pipeline run completed successfully and meets quality criteria.

Step 0 — Identify the experiment

Find the most recent experiment directory (or ask the user for the path):

ls -td experiments/cicd/cicd_* | head -5

Each experiment directory has one subdirectory per task (numbered 0–3), each containing a log file whose name varies by launch mode (Slurm: sbatch_*.out, local Docker: *.log).

Step 1 — Check task outcomes

Match the log files generally and read the tail of each:

find experiments/<exp_id>/ -type f \( -name '*.out' -o -name '*.log' \) | sort | while read -r f; do
  echo "=== $f ==="; tail -50 "$f"; echo
done

All 4 tasks must complete without error. Look for:

exit code: 0 or no error — success
DUE TO TIME LIMIT — timeout
FAILED / signal / exception traceback — failure

If any task failed, suggest running /eagle3-triage instead.

Step 2 — Verify artifacts exist

Check each step produced the expected output (artifacts live on the cluster at /scratchspace/). Confirm via log messages:

Step	Expected log evidence	Artifact
task_0	"Saved N samples" or progress bar completing	`/scratchspace/data/*.jsonl`
task_1	"Successfully processed N conversations"	`/scratchspace/offline_hidden_states/*.pt`
task_2	Training loss decreasing, "export complete"	`/scratchspace/eagle3/model.safetensors`, `/scratchspace/export/`
task_3	`Average Acceptance Length ... ratio: X.XX`	JSON result files

Step 3 — Check acceptance rate

In the task_3 log, find:

Average Acceptance Length {'accept': X, 'count': Y, 'ratio': Z.ZZ}

The ratio field is the acceptance rate (AR).

Criterion	Threshold	Status
AR (MT-Bench)	>= 2.1	PASS / FAIL

If the log shows AR ... < lower bound, the run already triggered a threshold failure (exit code 1).

Step 4 — Check training quality

In the task_2 log look for:

Final training loss — should be decreasing, not NaN
AR validation during training (if training.ar_validate_steps was set)
Number of training steps — confirms full training duration

Step 5 — Produce validation report

## EAGLE3 Pipeline Validation Report

**Experiment:** <exp_dir>
**Model:** <model_name>
**Date:** <date>
**Pipeline config:** <yaml_path>

### Step Status
| Step | Task | Status | Notes |
|------|------|--------|-------|
| 0 | Data synthesis | PASS/FAIL/TIMEOUT | N samples generated |
| 1 | Hidden state dump | PASS/FAIL | N .pt files |
| 2 | Training + export | PASS/FAIL | Final loss: X.XX |
| 3 | Benchmark | PASS/FAIL | AR: X.XX |

### Acceptance Rate
- MT-Bench AR: X.XX (threshold: >= 2.1) — PASS/FAIL

### Training Summary
- Final loss: X.XX
- Training steps: N
- AR during training: X.XX (if validated)

### Overall: PASS / FAIL
<one-line summary>

Step 6 — Suggest next steps

If PASS:

Record the verified result (and checkpoint path) in the team's internal triage tracker
This model is now a candidate to add as a launcher example in a dedicated PR

If FAIL:

Identify which step or metric failed
Suggest running /eagle3-triage for diagnosis
For a low AR, diagnose the specific cause from the run (training loss curve, data volume/quality, draft-head capacity, hyperparameters) and suggest fixes targeted to that scenario — low AR can have many causes, so avoid a generic checklist.