name: sv-data description: Build and manage Security Verifiers datasets. Use when asked to build E1 or E2 datasets, create test fixtures, validate data, or manage dataset files for network-logs or config-verification environments. metadata: author: security-verifiers version: "1.0"
Security Verifiers Data Management
Build, validate, and manage datasets for E1 (network-logs) and E2 (config-verification) environments.
Dataset Types
| Type | Purpose | Location | Committed |
|---|---|---|---|
| Production | Full training/eval data | environments/sv-env-*/data/ |
No (private) |
| Test fixtures | CI/unit tests | environments/sv-env-*/data/ |
Yes (small) |
| HuggingFace | Remote dataset access | HF Hub | Private repos |
E1: Network Logs Datasets
Build Production Dataset (IoT-23)
# Full dataset (1800 examples, 60/20/20 split)
make data-e1
# Custom limit
make data-e1 LIMIT=3000
Outputs: environments/sv-env-network-logs/data/iot23-train-dev-test-v1.jsonl
Build OOD (Out-of-Distribution) Datasets
# CIC-IDS-2017 and UNSW-NB15 (600 examples each)
make data-e1-ood
# Custom count
make data-e1-ood N=1000
Outputs:
cic-ids-2017-ood-v1.jsonlunsw-nb15-ood-v1.jsonl
Build Test Fixtures
# Small datasets for CI (20-30 examples)
make data-e1-test
E2: Config Verification Datasets
Clone Source Repositories
E2 datasets are built from real Kubernetes and Terraform configs:
# Clone recommended repos
make clone-e2-sources
Clones to scripts/data/sources/:
kubernetes/- K8s YAML manifeststerraform/- Terraform HCL configs
Build Production Dataset
# Using cloned sources
make data-e2-local
# Using custom paths
make data-e2 K8S_ROOT=/path/to/k8s TF_ROOT=/path/to/terraform
Outputs:
environments/sv-env-config-verification/data/k8s-labeled-v1.jsonlenvironments/sv-env-config-verification/data/terraform-labeled-v1.jsonl
Build Test Fixtures
# Requires clone-e2-sources first
make clone-e2-sources
make data-e2-test
Build All Datasets
# All E1 production datasets
make data-all
# All test fixtures (for CI)
make data-test-all
Data Validation
Validate datasets with Pydantic before HuggingFace push:
# Validate E1 splits
make validate-e1-data
# Validate E2 splits
make validate-e2-data
# Validate all
make validate-data
Dataset Schema
Note: Examples below show schema structure only. Actual benchmark data is gated to prevent training contamination. See
plans/ROADMAP-Q1-2026.mdfor benchmark integrity policy.
E1 Schema (network-logs)
Hub/local JSONL format:
{
"question": "<network log entry - content gated>",
"answer": "Benign|Malicious",
"meta": {
"source": "<dataset source>",
"scenario": "<capture scenario>",
"attack_family": "<attack type if malicious>",
"hash": "<content hash>",
"split": "train|dev|test"
}
}
E2 Schema (config-verification)
Hub/local JSONL format:
{
"question": "<k8s/terraform config - content gated>",
"info": {
"violations": [
{
"tool": "kube-linter|semgrep|opa",
"rule_id": "<rule identifier>",
"severity": "low|medium|high",
"msg": "<violation message>",
"loc": "<file:line if available>"
}
],
"patch": "<optional unified diff>"
},
"meta": {
"lang": "k8s|terraform",
"source": "<source repository>",
"hash": "<content hash>"
}
}
Schema Conversion
When datasets are loaded, the environment converts them to internal format:
- E1:
question→ prompt,answer→ expected label - E2:
question→ prompt,info→answer(JSON string with oracle violations)
The conversion happens in _convert_e2_format() in sv_env_config_verification.py.
Dataset Locations
| Environment | Data Directory |
|---|---|
| E1 network-logs | environments/sv-env-network-logs/data/ |
| E2 config-verification | environments/sv-env-config-verification/data/ |
Loading Datasets
Environments support multi-tier loading:
- Local: JSONL files in
data/directory - Hub: HuggingFace (requires HF_TOKEN)
- Synthetic: Built-in test fixtures (fallback)
import verifiers as vf
# Auto mode (tries local → hub → synthetic)
env = vf.load_environment("sv-env-network-logs")
# Explicit source
env = vf.load_environment("sv-env-network-logs", dataset_source="local")
env = vf.load_environment("sv-env-network-logs", dataset_source="hub")
env = vf.load_environment("sv-env-network-logs", dataset_source="synthetic")
Troubleshooting
HF_TOKEN required: Set in .env for gated dataset access.
Missing sources: Run make clone-e2-sources before E2 data builds.
Validation fails: Check schema matches expected Pydantic models in scripts/data/validate_splits_*.py.