name: seed-generator description: Generate labeled training examples for the BERT canonicalization classifier. Use to collect seed data from public APIs (OpenAPI specs), LLM tool-use datasets (ToolBench, API-Bank), or create synthetic variations. Outputs stratified JSONL with action, resource_type, and sensitivity labels. compatibility: Requires Python 3.10+, internet access for external datasets metadata: author: guard-team version: "1.0"
Seed Generator Skill
Generate labeled training examples for the canonicalization classifier. This skill enables systematic collection of seed data from diverse sources to build a robust, unbiased training dataset for BERT-based vocabulary canonicalization.
When to Use This Skill
Use this skill when you need to:
- Build initial seed dataset for the BERT canonicalization classifier (~50K-100K labeled examples)
- Collect examples from a specific source (OpenAPI specs, ToolBench, API-Bank, synthetic generation)
- Maintain stratified distribution across canonical labels (action, resource_type, sensitivity)
- Generate synthetic variations to handle uncommon synonyms and context variations
- Validate and curate examples before adding to the training set
Overview: The 5 Data Sources
The skill leverages five complementary sources to build an unbiased, comprehensive seed dataset:
| Source | Weight | Count | Best For |
|---|---|---|---|
| OpenAPI Specs | 30% | ~15K examples | Real-world API patterns, diverse vocabularies |
| ToolBench | 20% | ~10K examples | LLM tool-use instructions, agent patterns |
| API-Bank | 20% | ~10K examples | API calling patterns in dialogue context |
| Synthetic Variations | 20% | ~10K examples | Synonyms, context variations, edge cases |
| Manual Curation | 10% | ~5K examples | Domain expertise, corner cases, ambiguities |
Workflow: Step-by-Step Process
Agent Workflow Options
When using this skill, you have two approaches for generating labeled examples:
Option A: Script-Assisted Generation (Recommended for Agents)
Use fetch_openapi.py to extract raw examples, then apply labels in a second pass.
Steps:
- Run:
python scripts/fetch_openapi.py <spec_url> --output examples_<datetime>.jsonl - The script outputs examples with
labels: {action: null, resource_type: null, sensitivity: null} - Run:
python scripts/label_inplace.py examples_<datetime>.jsonl.jsonl - Review low-confidence labels and adjust using VOCABULARY.md
- Update any remaining labels and keep the file as your final output
Note: This approach requires two passes, but standardizes spec fetching and parsing.
Option B: Direct Generation
Generate JSONL examples directly without using the helper scripts. Use this when you cannot run the scripts or need tighter control over extraction.
Steps:
- Fetch the OpenAPI spec using web fetch tools
- Parse the JSON/YAML to extract operations
- For each operation:
- Generate
raw_textfrom the summary/description - Apply labeling rules from VOCABULARY.md to determine action, resource_type, sensitivity
- Create the complete JSONL entry with all fields populated
- Generate
- Output valid JSONL
Example - Complete workflow for one operation:
// Input: GitHub API operation
{
"path": "/repos/{owner}/{repo}/issues",
"method": "POST",
"summary": "Create an issue",
"description": "Creates a new issue in the specified repository"
}
// Step 1: Generate raw_text
"create an issue in the specified repository"
// Step 2: Apply labeling rules (from VOCABULARY.md)
// - "create" keyword → action: "write"
// - External API endpoint → resource_type: "api"
// - "issue" is project data, not PII → sensitivity: "internal"
// Step 3: Output complete JSONL entry
{"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "raw_text": "create an issue in the specified repository", "context": {"tool_name": "github-api", "tool_method": "POST /repos/{owner}/{repo}/issues", "resource_location": null}, "labels": {"action": "write", "resource_type": "api", "sensitivity": "internal"}, "source": "openapi-spec", "source_detail": "github-rest-api-v2024", "reviewed": false}
Step 1: Choose Your Data Source
Decide which source to target:
If you want: → Choose:
Real-world API operations → OpenAPI Specs (Stripe, GitHub, AWS, etc.)
LLM tool-use patterns → ToolBench dataset
API calling in dialogue → API-Bank dataset
Cover edge cases & synonyms → Synthetic generation
High-confidence baseline → Manual curation
For a complete seed dataset, you'll cycle through all sources.
Step 2: Extract Raw Text + Context
Extract the relevant information from your chosen source.
From OpenAPI specs:
operation_verb: "POST"
operation_path: "/repositories/{id}/issues"
description: "Create a new issue in the repository"
Raw text to label: "create a new issue in the repository"
Context: {
"tool_name": "github-api",
"tool_method": "POST /repositories/{id}/issues"
}
From ToolBench:
instruction: "retrieve all users from the customer database"
available_functions: [...]
Raw text to label: "retrieve all users from the customer database"
Context: {
"tool_name": (inferred from available functions)
}
From API-Bank:
user_utterance: "show me all active accounts"
ai_response: "[GetAccounts(status='active')]"
Raw text to label: "show me all active accounts" + "active accounts"
Context: {
"tool_name": "GetAccounts"
}
From Synthetic Generation: Generate variations of canonical examples using templates:
Base example: "read user data from the database"
Variations:
- "fetch user data from postgres"
- "query the users table"
- "retrieve user records"
- "select all users"
Step 3: Apply Labeling Rules
Use the detailed labeling rules in VOCABULARY.md to assign canonical labels.
Three fields to label:
action - What operation is being performed?
read: Retrieve/access data without modificationwrite: Create new dataupdate: Modify existing datadelete: Remove dataexecute: Run functions/processesexport: Extract data to external destination
resource_type - What kind of resource is being accessed?
database: SQL, NoSQL, structured data storesstorage: Files, blobs, object storage (S3, GCS, etc.)api: External service endpointsqueue: Message queues (SQS, Kafka, etc.)cache: Caching systems (Redis, Memcached)null: Unknown or context-dependent
sensitivity - How sensitive is the data likely to be?
public: Publicly accessible datainternal: Organization-only datasecret: Highly sensitive data (PII, credentials, etc.)null: Cannot be determined from context
Decision Tree Examples:
Q: "query the users table"
├─ Action: "query" → "read" (reading data)
├─ Resource: "users table" + "table" keyword → "database"
└─ Sensitivity: "users" (personal data) → "internal"
Q: "upsert records in mongodb"
├─ Action: "upsert" (create or update) → "write" (treat as write)
├─ Resource: "mongodb" → "database"
└─ Sensitivity: "records" (unknown type) → "null" or "internal" if PII-like
Q: "invoke payment webhook"
├─ Action: "invoke" (trigger execution) → "execute"
├─ Resource: "webhook" (external) → "api"
└─ Sensitivity: "payment" (sensitive) → "secret"
Q: "list files in s3 bucket"
├─ Action: "list" → "read"
├─ Resource: "s3 bucket" → "storage"
└─ Sensitivity: depends on bucket content → "null" or infer from bucket name
See VOCABULARY.md for complete labeling rules and ambiguous case handling.
Step 4: Generate JSONL Output
Format each labeled example as JSON and append to JSONL file (one JSON object per line).
Required schema:
{
"id": "unique-uuid-v4",
"raw_text": "the raw text to classify",
"context": {
"tool_name": "string or null",
"tool_method": "string or null",
"resource_location": "string or null"
},
"labels": {
"action": "read|write|update|delete|execute|export",
"resource_type": "database|storage|api|queue|cache|null",
"sensitivity": "public|internal|secret|null"
},
"source": "openapi-spec|toolbench|api-bank|synthetic|manual",
"source_detail": "stripe-api-v2024 or toolbench-2024-01 etc.",
"reviewed": false
}
Example valid entries:
{"id": "seed-001", "raw_text": "fetch all users from postgres", "context": {"tool_name": "database_query", "tool_method": "query", "resource_location": null}, "labels": {"action": "read", "resource_type": "database", "sensitivity": "internal"}, "source": "openapi-spec", "source_detail": "postgres-rest-api", "reviewed": false}
{"id": "seed-002", "raw_text": "create new payment transaction", "context": {"tool_name": "stripe-api", "tool_method": "POST /charges", "resource_location": null}, "labels": {"action": "write", "resource_type": "api", "sensitivity": "secret"}, "source": "openapi-spec", "source_detail": "stripe-api-v2024", "reviewed": false}
{"id": "seed-003", "raw_text": "list all active subscriptions", "context": {"tool_name": null, "tool_method": null, "resource_location": null}, "labels": {"action": "read", "resource_type": null, "sensitivity": null}, "source": "toolbench", "source_detail": "toolbench-2024-01", "reviewed": false}
See OUTPUT_FORMAT.md for complete schema validation rules.
Step 5: Validate & Stratify
Use the provided Python scripts to validate and analyze your generated dataset:
Validate examples:
python scripts/validate_examples.py data/seed/my_examples.jsonl
This checks:
- ✓ Valid JSON format (one object per line)
- ✓ All required fields present
- ✓ Label values are canonical
- ✓ No duplicate IDs
- ✓ No empty raw_text
Check category distribution:
python scripts/category_stats.py data/seed/my_examples.jsonl
Output shows distribution across all categories. Target: roughly equal examples per canonical label (~8-10% per action, ~20% per resource_type, ~33% per sensitivity).
Step 6: Human Review & Marking
After validation, review flagged examples:
- Ambiguous labels: Examples with multiple valid interpretations
- Edge cases: Examples at category boundaries
- Low confidence: Examples where the label is uncertain
Mark reviewed examples by updating the reviewed field to true:
{"id": "seed-001", ..., "reviewed": true}
Reviewed examples become part of the high-confidence baseline for model training.
Detailed Labeling Rules
See VOCABULARY.md for:
- Complete canonical vocabulary
- Explicit edge case rules (upsert, query, backup, etc.)
- Resource type inference from tool names
- Sensitivity inference from keywords
- Examples for each category
Data Source Guides
See DATA_SOURCES.md for:
- How to access each source (URLs, credentials)
- Parsing instructions for each format
- Example extraction walkthroughs
- Tips for handling each source efficiently
Output Format Reference
See OUTPUT_FORMAT.md for:
- Complete JSONL schema
- Validation rules
- Valid/invalid examples
- Tips for quality examples
Practical Examples
Example 1: Generate from OpenAPI Specs
Goal: Generate 200 examples from the GitHub API
Steps:
1. Access GitHub OpenAPI spec (see DATA_SOURCES.md)
2. Extract operation verbs and descriptions:
- GET /repos/{owner}/{repo}/issues → "retrieve repository issues"
- POST /repos/{owner}/{repo}/issues → "create a new issue"
- PATCH /repos/{owner}/{repo}/issues/{issue_number} → "update an issue"
- DELETE /repos/{owner}/{repo}/issues/{issue_number} → "delete an issue"
3. Apply labeling rules:
- GET → action: "read"
- POST → action: "write"
- PATCH → action: "update"
- DELETE → action: "delete"
- /repos → resource_type: "api"
4. Generate JSONL with 200 entries (stratified)
5. Validate with: python scripts/validate_examples.py
6. Check distribution with: python scripts/category_stats.py
7. Output: data/seed/github_api_200.jsonl
Example 2: Generate Synthetic Variations
Goal: Create 100 synthetic variations to cover edge cases
Base examples (manually curated):
- "read data from database"
- "write data to file storage"
- "delete old records"
Variations for "read":
- "query the database"
- "fetch data from postgres"
- "retrieve user records"
- "select all items"
- "search the index"
- "lookup customer info"
Generate 5 variations per base example → 15 synthetic examples per base
With ~6-7 carefully selected bases → ~100 synthetic variations
Example 3: Combine Sources for Balanced Dataset
Target: 50K total examples with balanced distribution
Plan:
- OpenAPI specs: 15K (30%)
- 3K per source: Stripe, GitHub, AWS, Google Cloud, Twilio
- ToolBench: 10K (20%)
- API-Bank: 10K (20%)
- Synthetic: 10K (20%)
- Manual curation: 5K (10%)
Process:
1. Generate from each source separately
2. Use category_stats.py after each batch to track distribution
3. Adjust subsequent batches to balance underrepresented categories
4. Combine all outputs: cat data/seed/*.jsonl > data/seed/combined_50k.jsonl
5. Final validation: python scripts/validate_examples.py data/seed/combined_50k.jsonl
Complete Worked Example: GitHub API
This example shows the full workflow from fetching a spec to outputting labeled examples.
Step 1: Fetch the OpenAPI Spec
Fetch from: https://raw.githubusercontent.com/github/rest-api-description/main/descriptions/api.github.com/api.github.com.json
Step 2: Extract 5 Operations
From the spec, extract operations like:
| Method | Path | Summary |
|---|---|---|
| GET | /repos/{owner}/{repo}/issues | List repository issues |
| POST | /repos/{owner}/{repo}/issues | Create an issue |
| PATCH | /repos/{owner}/{repo}/issues/{issue_number} | Update an issue |
| DELETE | /repos/{owner}/{repo}/issues/{issue_number}/lock | Unlock an issue |
| GET | /user | Get the authenticated user |
Step 3: Apply Labeling Rules
For each operation, apply the decision trees from VOCABULARY.md:
Example 1: GET /repos/{owner}/{repo}/issues
- Raw text: "list repository issues"
- Action: "list" → read (retrieval operation)
- Resource: GitHub API endpoint → api
- Sensitivity: "issues" are project data → internal
Example 2: POST /repos/{owner}/{repo}/issues
- Raw text: "create an issue"
- Action: "create" → write (creating new data)
- Resource: GitHub API endpoint → api
- Sensitivity: "issue" is project data → internal
Example 3: GET /user
- Raw text: "get the authenticated user"
- Action: "get" → read
- Resource: GitHub API endpoint → api
- Sensitivity: "authenticated user" contains user info → secret
Step 4: Generate JSONL Output
{"id": "gh-001", "raw_text": "list repository issues", "context": {"tool_name": "github-api", "tool_method": "GET /repos/{owner}/{repo}/issues", "resource_location": null}, "labels": {"action": "read", "resource_type": "api", "sensitivity": "internal"}, "source": "openapi-spec", "source_detail": "github-rest-api-2024", "reviewed": false}
{"id": "gh-002", "raw_text": "create an issue", "context": {"tool_name": "github-api", "tool_method": "POST /repos/{owner}/{repo}/issues", "resource_location": null}, "labels": {"action": "write", "resource_type": "api", "sensitivity": "internal"}, "source": "openapi-spec", "source_detail": "github-rest-api-2024", "reviewed": false}
{"id": "gh-003", "raw_text": "get the authenticated user", "context": {"tool_name": "github-api", "tool_method": "GET /user", "resource_location": null}, "labels": {"action": "read", "resource_type": "api", "sensitivity": "secret"}, "source": "openapi-spec", "source_detail": "github-rest-api-2024", "reviewed": false}
Step 5: Validate
Run validation to check your output:
python scripts/validate_examples.py data/seed/github_examples.jsonl
Quality Checklist
Before outputting your seed dataset, ensure:
- All examples have valid JSON format
- No duplicate IDs across the entire dataset
-
raw_textis non-empty and meaningful - Labels use only canonical values (from VOCABULARY.md)
- Distribution is roughly stratified (check with category_stats.py)
- At least 10% of examples have been manually reviewed
- No sensitive data (API keys, credentials) in raw_text
- Each example has proper source attribution
- Output is stored in
data/seed/directory
Helpful Scripts
The skill includes three helper scripts:
validate_examples.py
python scripts/validate_examples.py <jsonl_file>
Validates each example in JSONL file. Reports:
- Schema validation errors
- Invalid label values
- Duplicate IDs
- Empty fields
- Summary statistics
category_stats.py
python scripts/category_stats.py <jsonl_file>
Analyzes category distribution. Reports:
- Count per action label
- Count per resource_type label
- Count per sensitivity label
- Percentage balance
- Warnings for underrepresented categories
fetch_openapi.py
python scripts/fetch_openapi.py <spec_url> [--output <output_file>]
Fetches and parses OpenAPI spec. Extracts:
- Operation verbs (GET, POST, PUT, DELETE, PATCH)
- Endpoint paths
- Operation descriptions
- Parameter information Outputs raw text examples ready for labeling.
label_inplace.py
python scripts/label_inplace.py <jsonl_file> [--dry-run] [--backup] [--overwrite]
Applies heuristic labeling rules in-place using raw_text and context fields.
Prints low-confidence warnings so you can review and adjust before validation.
See individual scripts for detailed usage.
Tips & Best Practices
- Start small: Generate 100-200 examples from one source first to get the feel for labeling rules
- Use scripts early: Run validate_examples.py frequently during generation to catch errors early
- Check distribution: Run category_stats.py after each batch to ensure stratification
- Mix sources: Don't rely on a single source; diversity prevents overfitting to one API style
- Trust the vocabulary: When in doubt, refer back to VOCABULARY.md labeling rules
- Mark reviews: Always update
reviewed: truewhen you manually curate an example - Batch output: Generate 100-500 examples per batch for easier review and tracking
- Document sources: Keep
sourceandsource_detailfields accurate for traceability
Next Steps
After generating your seed dataset:
- Combine batches: Merge all JSONL files into single dataset
- Final validation: Run full validation and distribution check
- Create train/val/test split: Use 80/10/10 split for training
- Train BERT classifier: Use output as training data for canonicalization model
- Production logging: Monitor model on real intents, iterate with Phase 2 learning loop
For implementation details, see the documentation in the references/ folder.