name: validate-data description: Validate schema and data after a successful dlt pipeline load. Use when the user wants to check if loaded data looks correct, inspect table schemas, fix data types, flatten nested structures, or refine the data shape. argument-hint: "[pipeline-name] [concerns]"
Validate loaded data
After a successful pipeline load, verify the schema and data make sense. Fix data types, nested structures, and missing columns as needed.
Parse $ARGUMENTS:
pipeline-name(optional): the dlt pipeline name. If omitted, infer from session context. If ambiguous, ask the user and stop.hints(optional, after--): specific validation concerns
1. Inspect schema
Export schema as mermaid
uv run dlthub local pipeline schema <pipeline_name> --format mermaid
Show the mermaid diagram to the user. This gives a quick overview of tables, columns, types, and relationships (parent/child).
2. View the data
For the human: Workspace Dashboard
Tell the user to run Workspace Dashboard:
uv run dlthub local pipeline show <pipeline_name>
This opens a browser with table schemas, row counts, and sample data.
For the agent: set up pipeline MCP server to query the data
You have mcp with a right set of tools available
3. Review with user
Ask the user if the schema and data look right. Common issues to address:
Data type fixes
Use processing_steps in the resource config to transform data before loading. Available steps: map, filter, yield_map.
"processing_steps": [
{"map": lambda item: {**item, "amount": Decimal(item["amount"])}},
]
IMPORTANT: NEVER convert monetary amounts or precision-sensitive values to float. Always use Decimal.
Nested structures
dlt auto-unnests nested arrays into child tables (e.g., results inside a response becomes <resource>__results). This is often fine for analytics. If the user wants a flat structure, use yield_map to flatten, or adjust data_selector to point deeper into the response.
Missing columns
Columns that are all-null on first load won't have inferred types. Options:
- Add
columnshints to the resource config:"columns": {"field": {"data_type": "text"}} - Add
group_byor other API params to populate the columns
4. Iterate
Re-run the pipeline after changes (dev_mode gives a fresh dataset each time). Use debug-pipeline to inspect traces and load packages after each run. Inspect again with MCP or dlthub local pipeline schema <name> --format mermaid. Repeat until the user is happy with the schema.
Next steps
- User is happy with data → suggest
new-endpointfor more resources,view-datafor querying, or thedata-explorationtoolkit for interactive notebooks and reports - Need to fix pipeline code → edit and re-run with
debug-pipeline - User wants to see the data -> Workspace Dashboard with command above