name: sv-hf description: Manage HuggingFace datasets for Security Verifiers. Use when asked to push datasets to HuggingFace, manage metadata, configure gated access, or set up user HF repositories for E1/E2 datasets. metadata: author: security-verifiers version: "1.0"
Security Verifiers HuggingFace Management
Push, validate, and manage datasets on HuggingFace Hub for E1 (network-logs) and E2 (config-verification) environments.
Repository Structure
| Repo Type | E1 Repo | E2 Repo | Access |
|---|---|---|---|
| Public metadata | {org}/security-verifiers-e1-metadata |
{org}/security-verifiers-e2-metadata |
Public |
| Private canonical | {org}/security-verifiers-e1 |
{org}/security-verifiers-e2 |
Gated |
Prerequisites
Set environment variables in .env:
HF_TOKEN=hf_your_token_here
E1_HF_REPO=your-org/security-verifiers-e1
E2_HF_REPO=your-org/security-verifiers-e2
Quick Reference
# Build metadata locally
make hf-e1-meta
make hf-e2-meta
# Push to PUBLIC repos (metadata only)
make hf-e1-push HF_ORG=your-org
make hf-e2-push HF_ORG=your-org
# Push to PRIVATE repos (canonical splits with Features)
make hf-e1p-push-canonical HF_ORG=your-org
make hf-e2p-push-canonical HF_ORG=your-org
# Validate before push
make validate-data
# Push all metadata
make hf-push-all HF_ORG=your-org
Metadata Push (Public Repos)
Metadata repos provide Dataset Viewer compatibility without exposing sensitive data.
Build Metadata Locally
make hf-e1-meta # → build/hf/e1/meta.jsonl
make hf-e2-meta # → build/hf/e2/meta.jsonl
Push to Public Repos
# Default org: intertwine-ai
make hf-e1-push
make hf-e2-push
# Custom org
make hf-e1-push HF_ORG=your-org
make hf-e2-push HF_ORG=your-org
Canonical Push (Private Repos)
Canonical repos contain full datasets with explicit HuggingFace Features schema.
Validate First
make validate-e1-data
make validate-e2-data
# or
make validate-data # both
Push Canonical Splits
# E1 canonical (train/dev/test splits)
make hf-e1p-push-canonical HF_ORG=your-org
# E2 canonical
make hf-e2p-push-canonical HF_ORG=your-org
Warning: Canonical push uses --force which deletes and recreates the repo. Use only when schema changes are needed.
Dry Run
make hf-e1p-push-canonical-dry HF_ORG=your-org
make hf-e2p-push-canonical-dry HF_ORG=your-org
User Dataset Setup
For users deploying their own Security Verifiers instances:
1. Build Datasets Locally
make data-e1 data-e1-ood
make clone-e2-sources && make data-e2-local
2. Configure HF Repos
export HF_TOKEN=hf_your_token
export E1_HF_REPO=your-org/security-verifiers-e1-private
export E2_HF_REPO=your-org/security-verifiers-e2-private
3. Push Datasets
make hub-push-datasets
4. Test Loading
make hub-test-datasets
Gated Access
Private repos use manual gated access to prevent training contamination:
- Go to repo Settings → Access
- Enable "Gated repository"
- Set to "Manual approval"
- Users must request access and set
HF_TOKEN
Template READMEs for gated repos are in scripts/hf/templates/.
Dataset Loading in Code
import os
from datasets import load_dataset
# Set token
os.environ["HF_TOKEN"] = "hf_your_token"
# Load from private repo
dataset = load_dataset(
"your-org/security-verifiers-e1",
split="train",
token=os.environ["HF_TOKEN"]
)
Environment Loading Modes
Environments automatically handle dataset loading:
import verifiers as vf
# Auto: tries local → hub → synthetic
env = vf.load_environment("sv-env-network-logs")
# Explicit hub loading
env = vf.load_environment("sv-env-network-logs", dataset_source="hub")
# Synthetic fallback (for testing)
env = vf.load_environment("sv-env-network-logs", dataset_source="synthetic")
Troubleshooting
401 Unauthorized: Check HF_TOKEN is set and has write access.
Gated access denied: Request access on HF repo page, then set HF_TOKEN.
Schema mismatch: Run make validate-data before push.
Force push warning: Canonical push recreates repos; use only for schema updates.
File Locations
| Purpose | Location |
|---|---|
| HF push scripts | scripts/hf/ |
| Metadata export | scripts/hf/export_metadata_flat.py |
| Canonical push | scripts/hf/push_canonical_with_features.py |
| Validation scripts | scripts/data/validate_splits_e1.py, validate_splits_e2.py |
| Gated README templates | scripts/hf/templates/ |