name: dv-data-preparation
description: Ingest, clean, validate, transform, and shape data into visualization-ready outputs. Use when the user types $dv-data-preparation or when the task needs ETL, schema design, SQL/NoSQL queries, migrations, database tuning, or dashboard-ready dataset preparation.
license: MIT
argument-hint: "[goal / task / brief]"
metadata:
author: data-visualization-kit
version: "2.0.0"
DV Data Preparation
Treat $dv-data-preparation as the canonical data preparation entrypoint and the canonical data-preparation methodology for Data Visualization Kit.
Production-ready guide for ingestion, data cleaning, validation, transformation, schema design, ETL patterns, and database-backed preparation work that feeds exactly one selected visualization path.
Codex Adaptation
- Use this skill directly for
$dv-data-preparationinstead of delegating to a separate data-preparation workflow or prompt layer. - Keep all resulting notes, exports, and prepared datasets inside the active project
docs/tree when the task is project-scoped. - Require SQL and/or Python whenever implementation details, transformations, or reproducible prep steps are produced.
- Produce clean, validated outputs that are ready for exactly one selected visualization path.
When to Use
Use when:
- ingesting source data from files, databases, APIs, or mixed pipelines
- cleaning and validating raw business data before visualization
- designing schemas, models, or storage structures for prepared data
- writing SQL or MongoDB queries, joins, aggregations, and transformation logic
- building incremental ETL or watermark-based refresh flows
- optimizing database performance, indexes, migrations, backup, or operational reliability
- shaping final datasets for chart-ready, dashboard-ready, or semantic-model-ready use
Working Order
- Confirm the business goal and selected visualization path
- Identify source systems, refresh expectations, and trust issues
- Profile the raw data and prove data quality problems before rewriting logic
- Choose the right preparation surface:
- SQL-first relational shaping
- document-oriented modeling
- incremental ETL
- schema redesign
- Produce reproducible transformation steps
- Validate the prepared outputs against downstream visualization needs
- Record assumptions, transformations, and unresolved data risks in project docs
Reference Navigation
Core Preparation References
db-design.md- schema design, transactional vs analytics modeling, fact/dimension planningtransactional.md- OLTP-oriented modeling rulesanalytics.md- OLAP and star-schema oriented rulesincremental-etl.md- watermark, append/update, idempotent load patterns
MongoDB References
references/mongodb-crud.md- CRUD operations and query patternsreferences/mongodb-aggregation.md- aggregation pipelines and transformation patternsreferences/mongodb-indexing.md- index strategy and performance tuningreferences/mongodb-atlas.md- Atlas setup, operations, and hosted patterns
PostgreSQL References
references/postgresql-queries.md- SELECT, JOIN, CTE, and window-function patternsreferences/postgresql-psql-cli.md-psqlcommands and scriptingreferences/postgresql-performance.md- EXPLAIN, tuning, vacuum, and query optimizationreferences/postgresql-administration.md- backup, replication, maintenance, user administration
Stack-Specific References
stacks/postgres.mdstacks/mysql.mdstacks/sqlite.mdstacks/bigquery.mdstacks/d1_cloudflare.md
Python Utilities
Preparation and database utility scripts in scripts/:
db_migrate.py- generate and apply migrations for MongoDB and PostgreSQLdb_backup.py- backup and restore MongoDB and PostgreSQLdb_performance_check.py- analyze slow queries and recommend indexes
python scripts/db_migrate.py --db mongodb --generate "add_user_index"
python scripts/db_backup.py --db postgres --output /backups/
python scripts/db_performance_check.py --db mongodb --threshold 100ms
Data Preparation Mindset
- Prepared data is a product surface, not a temporary byproduct
- Fix semantics before polishing visualization
- Every transformation must be explainable and reproducible
- Schema choice should follow access pattern, not habit
- Incremental refresh logic must be explicit
- Performance matters once the dataset leaves toy scale
- Data quality checks belong inside the workflow
- One canonical metric definition beats many local copies
- Downstream chart needs should shape output tables
- If the prep is not trustworthy, the dashboard is not trustworthy
Best Practices
Preparation Strategy:
- Start from business questions and downstream charts, not just source tables
- Prefer stable prepared models over repeated ad hoc transformations
- Keep data contracts explicit: grain, keys, freshness, null policy, metric definitions
Relational Work:
- Normalize for transactional systems, denormalize deliberately for analytics outputs
- Index foreign keys and common filter paths
- Use EXPLAIN ANALYZE when query cost matters
Document and Mixed Data:
- Use embedding for tight 1-to-few relationships
- Use references for large fan-out or reuse-heavy structures
- Use aggregation pipelines when document-native transformation is the right fit
ETL and Quality:
- Make incremental rules idempotent
- Validate row counts, duplicates, null spikes, and key integrity after each major transform
- Record assumptions whenever source data is patched or inferred
Quick Decision Matrix
| Need | Choose |
|---|---|
| General ingestion, cleaning, validation | $dv-data-preparation |
| Schema or warehouse-ready model design | db-design.md + stack references |
| Relational query shaping | PostgreSQL references |
| MongoDB query or aggregation shaping | MongoDB references |
| Incremental refresh logic | incremental-etl.md |
| Database reliability, tuning, migrations | database agent + this skill |
Output Rules
- Keep project-scoped notes in
projects/<slug>/docs/ - State source systems, transformations, validation checks, and resulting prepared outputs
- Ensure prepared outputs are aligned to one selected visualization path
- Route to
$dv-data-visualizeafter the prepared surface is trustworthy