name: sv-eval description: Run and analyze Security Verifiers evaluations. Use when asked to evaluate models on E1 (network-logs) or E2 (config-verification), generate metrics reports, compare model performance, or analyze eval results. metadata: author: security-verifiers version: "1.0"
Security Verifiers Evaluation
Run reproducible evaluations on E1 (network-logs) and E2 (config-verification) environments, generate metrics reports, and analyze results.
Prerequisites
Ensure environment variables are set (check .env file):
OPENAI_API_KEY- For OpenAI models (gpt-*)OPENROUTER_API_KEY- For non-OpenAI models via OpenRouterWANDB_API_KEY- For Weave logging (optional)
Running Evaluations
E1: Network Logs (Classification)
Single-turn anomaly detection with calibration and asymmetric costs.
# Basic eval (10 examples)
make eval-e1 MODELS="gpt-5-mini" N=10
# Multiple models
make eval-e1 MODELS="gpt-5-mini,gpt-4.1-mini,qwen3-14b" N=100
# Full production dataset
make eval-e1 MODELS="gpt-5-mini" N=1800 DATASET="iot23-train-dev-test-v1.jsonl"
# OOD datasets
make eval-e1 MODELS="gpt-5-mini" N=600 DATASET="cic-ids-2017-ood-v1.jsonl"
make eval-e1 MODELS="gpt-5-mini" N=600 DATASET="unsw-nb15-ood-v1.jsonl"
E2: Config Verification (Tool-Use)
Multi-turn tool-grounded auditing with KubeLinter, Semgrep, and OPA.
# Basic eval with tools (2 examples)
make eval-e2 MODELS="gpt-5-mini" N=2 INCLUDE_TOOLS=true
# Without tools (single-turn)
make eval-e2 MODELS="gpt-5-mini" N=10 INCLUDE_TOOLS=false
# Dataset options
make eval-e2 MODELS="gpt-5-mini" N=50 DATASET="k8s-labeled-v1.jsonl"
make eval-e2 MODELS="gpt-5-mini" N=50 DATASET="terraform-labeled-v1.jsonl"
make eval-e2 MODELS="gpt-5-mini" N=100 DATASET="combined" # default
Error Handling
Early stopping prevents wasted API costs:
# Stop after 5 consecutive errors (default: 3)
make eval-e1 MODELS="gpt-5-mini" N=100 MAX_CONSECUTIVE_ERRORS=5
# Disable early stopping
make eval-e1 MODELS="gpt-5-mini" N=100 MAX_CONSECUTIVE_ERRORS=0
Generating Reports
SV-Bench supports two kinds of reporting:
- Per-run contract reports:
summary.json(schema-stable) +report.md(human readable) - Cross-run comparison reports: timestamped
report-*.jsonaggregations for quick comparisons
Per-run SV-Bench report (contract-grade)
# E1
WEAVE_DISABLED=true .venv/bin/svbench_report --env e1 --input outputs/evals/sv-env-network-logs--gpt-5-mini/<run_id> --strict
# E2
WEAVE_DISABLED=true .venv/bin/svbench_report --env e2 --input outputs/evals/sv-env-config-verification--gpt-5-mini/<run_id> --strict
E1 comparison report (legacy-friendly JSON table)
# All non-archived runs
make report-network-logs
# Specific runs
make report-network-logs RUN_IDS="run_abc123 run_def456"
# Custom output path
make report-network-logs OUTPUT="reports/e1-comparison.json"
E2 comparison report (legacy-friendly JSON table)
make report-config-verification
make report-config-verification RUN_IDS="run_abc123"
Understanding Results
Output Structure
outputs/evals/sv-env-{name}--{model}/{run_id}/
├── metadata.json # Run config, versions, git SHA
├── results.jsonl # Per-example results
├── summary.json # SV-Bench schema summary (contract-grade)
└── report.md # Human-readable report
Note: OpenRouter models create nested directories:
outputs/evals/sv-env-network-logs--qwen/qwen-2.5-7b-instruct/{run_id}/
outputs/evals/sv-env-network-logs--meta-llama/llama-3.1-8b-instruct/{run_id}/
The report scripts use recursive glob to find all runs regardless of nesting depth.
Key E1 Metrics (network-logs)
| Metric | Description | Target |
|---|---|---|
| Accuracy | Overall classification accuracy | Higher is better |
| ECE | Expected Calibration Error | Lower is better |
| FN% | False negative rate (missed threats) | Minimize |
| FP% | False positive rate | Minimize |
| Abstain% | Abstention rate | Context-dependent |
Key E2 Metrics (config-verification)
| Metric | Description | Target |
|---|---|---|
| MeanReward | Average episode reward | Higher is better |
| FormatSuccess% | Valid JSON output rate | 100% |
| AvgTools | Tool calls per episode | Lower is efficient |
| AvgTurns | Turns per episode | Lower is efficient |
Model Routing
Model names are auto-resolved via scripts/model_router.py:
- OpenAI models:
gpt-5-mini,gpt-4.1-mini,o1-mini - OpenRouter models:
qwen3-14b→qwen/qwen3-14b,llama-3.1-8b→meta-llama/llama-3.1-8b-instruct
Comparing Models
- Run evals with same parameters:
make eval-e1 MODELS="gpt-5-mini,gpt-4.1-mini,qwen3-14b" N=500
- Generate report:
make report-network-logs
- Review summary.json in each run directory for per-model metrics.
Troubleshooting
Rate limits: Reduce N or use MAX_CONSECUTIVE_ERRORS.
Missing API key: Check .env has correct key for model provider.
Model not found: Use full OpenRouter path (e.g., openai/gpt-5-mini).
FormatSuccess% = 0%
If a model shows 0% format success, common causes:
Markdown code blocks: Many models (gpt-4o-mini, qwen, llama) wrap JSON in
```json...```blocks. The parsers handle this automatically viaextract_json_from_markdown()insv_shared/parsers.py.Model outputs prose instead of JSON: Some models explain what they would do rather than returning the required JSON format. This is model behavior, not a parsing bug. Check
results.jsonlcompletions.Empty prompts: If
promptfield in results.jsonl is empty, the dataset loading may have failed. Check that:- Hub datasets have correct schema (
questionfield) - Local JSONL files exist in
environments/sv-env-*/data/ HF_TOKENis set for Hub access
- Hub datasets have correct schema (
Investigating Failed Runs
# Check a specific result file
cat outputs/evals/sv-env-*--{model}/{run_id}/results.jsonl | head -1 | python -m json.tool
# Look for empty prompts (indicates dataset loading issue)
grep '"prompt": ""' outputs/evals/sv-env-*--{model}/{run_id}/results.jsonl
# Check completion format
cat outputs/evals/sv-env-*--{model}/{run_id}/results.jsonl | python -c "
import json, sys
for line in sys.stdin:
r = json.loads(line)
print(f'Index {r[\"index\"]}: format_reward={r[\"rewards\"][\"format_reward\"]}')
print(f' completion: {repr(r[\"completion\"][:100])}...')
"