name: eagle3-validate description: > Validate that an EAGLE3 pipeline run completed successfully end-to-end. Checks all 4 steps produced expected artifacts, verifies acceptance rate meets threshold (>= 2.1), and produces a summary report. Use when user wants to verify a pipeline run or check benchmark results. user_invocable: true
EAGLE3 Pipeline Validation
Verify that an EAGLE3 pipeline run completed successfully and meets quality criteria.
Step 0 — Identify the experiment
Find the most recent experiment directory (or ask the user for the path):
ls -td experiments/cicd/cicd_* | head -5
Each experiment directory has one subdirectory per task (numbered 0–3), each containing a
log file whose name varies by launch mode (Slurm: sbatch_*.out, local Docker: *.log).
Step 1 — Check task outcomes
Match the log files generally and read the tail of each:
find experiments/<exp_id>/ -type f \( -name '*.out' -o -name '*.log' \) | sort | while read -r f; do
echo "=== $f ==="; tail -50 "$f"; echo
done
All 4 tasks must complete without error. Look for:
exit code: 0or no error — successDUE TO TIME LIMIT— timeoutFAILED/signal/ exception traceback — failure
If any task failed, suggest running /eagle3-triage instead.
Step 2 — Verify artifacts exist
Check each step produced the expected output (artifacts live on the cluster at /scratchspace/).
Confirm via log messages:
| Step | Expected log evidence | Artifact |
|---|---|---|
| task_0 | "Saved N samples" or progress bar completing | /scratchspace/data/*.jsonl |
| task_1 | "Successfully processed N conversations" | /scratchspace/offline_hidden_states/*.pt |
| task_2 | Training loss decreasing, "export complete" | /scratchspace/eagle3/model.safetensors, /scratchspace/export/ |
| task_3 | Average Acceptance Length ... ratio: X.XX |
JSON result files |
Step 3 — Check acceptance rate
In the task_3 log, find:
Average Acceptance Length {'accept': X, 'count': Y, 'ratio': Z.ZZ}
The ratio field is the acceptance rate (AR).
| Criterion | Threshold | Status |
|---|---|---|
| AR (MT-Bench) | >= 2.1 | PASS / FAIL |
If the log shows AR ... < lower bound, the run already triggered a threshold failure (exit code 1).
Step 4 — Check training quality
In the task_2 log look for:
- Final training loss — should be decreasing, not NaN
- AR validation during training (if
training.ar_validate_stepswas set) - Number of training steps — confirms full training duration
Step 5 — Produce validation report
## EAGLE3 Pipeline Validation Report
**Experiment:** <exp_dir>
**Model:** <model_name>
**Date:** <date>
**Pipeline config:** <yaml_path>
### Step Status
| Step | Task | Status | Notes |
|------|------|--------|-------|
| 0 | Data synthesis | PASS/FAIL/TIMEOUT | N samples generated |
| 1 | Hidden state dump | PASS/FAIL | N .pt files |
| 2 | Training + export | PASS/FAIL | Final loss: X.XX |
| 3 | Benchmark | PASS/FAIL | AR: X.XX |
### Acceptance Rate
- MT-Bench AR: X.XX (threshold: >= 2.1) — PASS/FAIL
### Training Summary
- Final loss: X.XX
- Training steps: N
- AR during training: X.XX (if validated)
### Overall: PASS / FAIL
<one-line summary>
Step 6 — Suggest next steps
If PASS:
- Record the verified result (and checkpoint path) in the team's internal triage tracker
- This model is now a candidate to add as a launcher example in a dedicated PR
If FAIL:
- Identify which step or metric failed
- Suggest running
/eagle3-triagefor diagnosis - For a low AR, diagnose the specific cause from the run (training loss curve, data volume/quality, draft-head capacity, hyperparameters) and suggest fixes targeted to that scenario — low AR can have many causes, so avoid a generic checklist.