name: hugging-face-datasets
description: Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.
Overview
This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
Integration with HF MCP Server
Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
Version
2.1.0
Dependencies
huggingface_hub
duckdb (for SQL queries)
datasets (for pushing query results to Hub)
json (built-in)
time (built-in)
Core Capabilities
1. Dataset Lifecycle Management
Initialize: Create new dataset repositories with proper structure
Configure: Store detailed configuration including system prompts and metadata
Stream Updates: Add rows efficiently without downloading entire datasets
2. SQL-Based Dataset Querying (NEW)
Query any Hugging Face dataset using DuckDB SQL via scripts/sql_manager.py:
Direct Queries: Run SQL on datasets using the
hf://protocolSchema Discovery: Describe dataset structure and column types
Data Sampling: Get random samples for exploration
Aggregations: Count, histogram, unique values analysis
Transformations: Filter, join, reshape data with SQL
Export & Push: Save results locally or push to new Hub repos
3. Multi-Format Dataset Support
Supports diverse dataset types through template system:
Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
Text Classification: Sentiment analysis, intent detection, topic classification
Question-Answering: Reading comprehension, factual QA, knowledge bases
Text Completion: Language modeling, code completion, creative writing
Tabular Data: Structured data for regression/classification tasks
Custom Formats: Flexible schema definition for specialized needs
4. Quality Assurance Features
JSON Validation: Ensures data integrity during uploads
Batch Processing: Efficient handling of large datasets
Error Recovery: Graceful handling of upload failures and conflicts
Usage Instructions
The skill includes two Python scripts:
scripts/dataset_manager.py- Dataset creation and managementscripts/sql_manager.py- SQL-based dataset querying and transformation
Prerequisites
huggingface_hublibrary:uv add huggingface_hubduckdblibrary (for SQL):uv add duckdbdatasetslibrary (for pushing):uv add datasetsHF_TOKENenvironment variable must be set with a Write-access tokenActivate virtual environment:
source .venv/bin/activate
SQL Dataset Querying (sql_manager.py)
Query, transform, and push Hugging Face datasets using DuckDB SQL. The hf:// protocol provides direct access to any public dataset (or private with token).
Quick Start
# Query a dataset
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
# Get dataset schema
python scripts/sql_manager.py describe --dataset "cais/mmlu"
# Sample random rows
python scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
# Count rows with filter
python scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
SQL Query Syntax
Use data as the table name in your SQL - it gets replaced with the actual hf:// path:
-- Basic select
SELECT * FROM data LIMIT 10
-- Filtering
SELECT * FROM data WHERE subject='nutrition'
-- Aggregations
SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
-- Column selection and transformation
SELECT question, choices[answer] AS correct_answer FROM data
-- Regex matching
SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
-- String functions
SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
Common Operations
1. Explore Dataset Structure
# Get schema
python scripts/sql_manager.py describe --dataset "cais/mmlu"
# Get unique values in column
python scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
# Get value distribution
python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
2. Filter and Transform
# Complex filtering with SQL
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
# Using transform command
python scripts/sql_manager.py transform \
--dataset "cais/mmlu" \
--select "subject, COUNT(*) as cnt" \
--group-by "subject" \
--order-by "cnt DESC" \
--limit 10
3. Create Subsets and Push to Hub
# Query and push to new dataset
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject='nutrition'" \
--push-to "username/mmlu-nutrition-subset" \
--private
# Transform and push
python scripts/sql_manager.py transform \
--dataset "ibm/duorc" \
--config "ParaphraseRC" \
--select "question, answers" \
--where "LENGTH(question) > 50" \
--push-to "username/duorc-long-questions"
4. Export to Local Files
# Export to Parquet
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject='nutrition'" \
--output "nutrition.parquet" \
--format parquet
# Export to JSONL
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data LIMIT 100" \
--output "sample.jsonl" \
--format jsonl
5. Working with Dataset Configs/Splits
# Specify config (subset)
python scripts/sql_manager.py query \
--dataset "ibm/duorc" \
--config "ParaphraseRC" \
--sql "SELECT * FROM data LIMIT 5"
# Specify split
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--split "test" \
--sql "SELECT COUNT(*) FROM data"
# Query all splits
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--split "*" \
--sql "SELECT * FROM data LIMIT 10"
6. Raw SQL with Full Paths
For complex queries or joining datasets:
python scripts/sql_manager.py raw --sql "
SELECT a.*, b.*
FROM 'hf://datasets/dataset1@~parquet/default/train/*.parquet' a
JOIN 'hf://datasets/dataset2@~parquet/default/train/*.parquet' b
ON a.id = b.id
LIMIT 100
"
Python API Usage
from sql_manager import HFDatasetSQL
sql = HFDatasetSQL()
# Query
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
# Get schema
schema = sql.describe("cais/mmlu")
# Sample
samples = sql.sample("cais/mmlu", n=5, seed=42)
# Count
count = sql.count("cais/mmlu", where="subject='nutrition'")
# Histogram
dist = sql.histogram("cais/mmlu", "subject")
# Filter and transform
results = sql.filter_and_transform(
"cais/mmlu",
select="subject, COUNT(*) as cnt",
group_by="subject",
order_by="cnt DESC",
limit=10
)
# Push to Hub
url = sql.push_to_hub(
"cais/mmlu",
"username/nutrition-subset",
sql="SELECT * FROM data WHERE subject='nutrition'",
private=True
)
# Export locally
sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
HF Path Format
DuckDB uses the hf:// protocol to access datasets:
hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
Examples:
hf://datasets/cais/mmlu@~parquet/default/train/*.parquethf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
The @~parquet revision provides auto-converted Parquet files for any dataset format.
Useful DuckDB SQL Functions
-- String functions
LENGTH(column) -- String length
regexp_replace(col, '\n', '') -- Regex replace
regexp_matches(col, 'pattern') -- Regex match
LOWER(col), UPPER(col) -- Case conversion
-- Array functions
choices[0] -- Array indexing (0-based)
array_length(choices) -- Array length
unnest(choices) -- Expand array to rows
-- Aggregations
COUNT(*), SUM(col), AVG(col)
GROUP BY col HAVING condition
-- Sampling
USING SAMPLE 10 -- Random sample
USING SAMPLE 10 (RESERVOIR, 42) -- Reproducible sample
-- Window functions
ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)
Dataset Creation (dataset_manager.py)
Recommended Workflow
1. Discovery (Use HF MCP Server):
# Use HF MCP tools to find existing datasets
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")
2. Creation (Use This Skill):
# Initialize new dataset
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
# Configure with detailed system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
3. Content Management (Use This Skill):
# Quick setup with any template
python scripts/dataset_manager.py quick_setup \
--repo_id "your-username/dataset-name" \
--template classification
# Add data with template validation
python scripts/dataset_manager.py add_rows \
--repo_id "your-username/dataset-name" \
--template qa \
--rows_json "$(cat your_qa_data.json)"
Template-Based Data Structures
1. Chat Template (--template chat)
{
"messages": [
{"role": "user", "content": "Natural user request"},
{"role": "assistant", "content": "Response with tool usage"},
{"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
],
"scenario": "Description of use case",
"complexity": "simple|intermediate|advanced"
}
2. Classification Template (--template classification)
{
"text": "Input text to be classified",
"label": "classification_label",
"confidence": 0.95,
"metadata": {"domain": "technology", "language": "en"}
}
3. QA Template (--template qa)
{
"question": "What is the question being asked?",
"answer": "The complete answer",
"context": "Additional context if needed",
"answer_type": "factual|explanatory|opinion",
"difficulty": "easy|medium|hard"
}
4. Completion Template (--template completion)
{
"prompt": "The beginning text or context",
"completion": "The expected continuation",
"domain": "code|creative|technical|conversational",
"style": "description of writing style"
}
5. Tabular Template (--template tabular)
{
"columns": [
{"name": "feature1", "type": "numeric", "description": "First feature"},
{"name": "target", "type": "categorical", "description": "Target variable"}
],
"data": [
{"feature1": 123, "target": "class_a"},
{"feature1": 456, "target": "class_b"}
]
}
Advanced System Prompt Template
For high-quality training data generation:
You are an AI assistant expert at using MCP tools effectively.
## MCP SERVER DEFINITIONS
[Define available servers and tools]
## TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]
## QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
## EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]
Example Categories & Templates
The skill includes diverse training examples beyond just MCP usage:
Available Example Sets:
training_examples.json- MCP tool usage examples (debugging, project setup, database analysis)diverse_training_examples.json- Broader scenarios including:Educational Chat - Explaining programming concepts, tutorials
Git Workflows - Feature branches, version control guidance
Code Analysis - Performance optimization, architecture review
Content Generation - Professional writing, creative brainstorming
Codebase Navigation - Legacy code exploration, systematic analysis
Conversational Support - Problem-solving, technical discussions
Using Different Example Sets:
# Add MCP-focused examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(cat examples/training_examples.json)"
# Add diverse conversational examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(cat examples/diverse_training_examples.json)"
# Mix both for comprehensive training data
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
Commands Reference
List Available Templates:
python scripts/dataset_manager.py list_templates
Quick Setup (Recommended):
python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
Manual Setup:
# Initialize repository
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
# Configure with system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
# Add data with validation
python scripts/dataset_manager.py add_rows \
--repo_id "your-username/dataset-name" \
--template qa \
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
View Dataset Statistics:
python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
Error Handling
Repository exists: Script will notify and continue with configuration
Invalid JSON: Clear error message with parsing details
Network issues: Automatic retry for transient failures
Token permissions: Validation before operations begin
Combined Workflow Examples
Example 1: Create Training Subset from Existing Dataset
# 1. Explore the source dataset
python scripts/sql_manager.py describe --dataset "cais/mmlu"
python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
# 2. Query and create subset
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')" \
--push-to "username/mmlu-medical-subset" \
--private
Example 2: Transform and Reshape Data
# Transform MMLU to QA format with correct answers extracted
python scripts/sql_manager.py query \
--dataset "cais/mmlu" \
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data" \
--push-to "username/mmlu-qa-format"
Example 3: Merge Multiple Dataset Splits
# Export multiple splits and combine
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--split "*" \
--output "mmlu_all.parquet"
Example 4: Quality Filtering
# Filter for high-quality examples
python scripts/sql_manager.py query \
--dataset "squad" \
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20" \
--push-to "username/squad-filtered"
Example 5: Create Custom Training Dataset
# 1. Query source data
python scripts/sql_manager.py export \
--dataset "cais/mmlu" \
--sql "SELECT question, subject FROM data WHERE subject='nutrition'" \
--output "nutrition_source.jsonl" \
--format jsonl
# 2. Process with your pipeline (add answers, format, etc.)
# 3. Push processed data
python scripts/dataset_manager.py init --repo_id "username/nutrition-training"
python scripts/dataset_manager.py add_rows \
--repo_id "username/nutrition-training" \
--template qa \
--rows_json "$(cat processed_data.json)"