debug-transformation

star 49

Debug dlthub transformation failures. Use when a transformation fails on a different destination than it was developed on, SQL dialect errors occur after deployment, pipeline recovery is needed after a failed run, or columns are silently dropped from output.

dlt-hub By dlt-hub schedule Updated 6/3/2026

name: debug-transformation description: Debug dlthub transformation failures. Use when a transformation fails on a different destination than it was developed on, SQL dialect errors occur after deployment, pipeline recovery is needed after a failed run, or columns are silently dropped from output.

Debug transformation

Diagnose and fix dlthub transformation failures. Two main failure classes: SQL dialect incompatibility (transformation works on dev destination, fails on production) and pipeline state errors (stale packages, schema drift, failed jobs).

When to use this skill

  • Transformation works on DuckDB locally but fails on BigQuery, Snowflake, or Postgres after deployment
  • Pipeline fails with SQL syntax errors or "unsupported function" messages
  • You want to verify SQL portability before deploying (proactive check)
  • Pipeline is stuck in a failed/retry loop
  • Columns are missing from output after the first run (NULL-only or computed/derived columns)
  • You want to inspect what the transformation actually produced after a successful run

1. SQL dialect compatibility

1a. Static SQL compatibility checks

The toolkit includes a bundled dialect checker that reads your transformation file directly — no pipeline connection needed, and no script to write. It parses each @dlt.hub.transformation function's SQL in the dev dialect and transpiles it to the target dialect, catching portability issues before deployment.

Tell the user: "I'm going to run a static SQL compatibility check on your transformation file to catch any SQL dialect issues before you move to prod."

You must pass the dev and prod destination types explicitly via --read and --write. If you already have this from context (e.g. from create-transformation), use it directly. If not, call get_local_pipeline_state via MCP to retrieve the destination type for the pipeline, then ask the user for the prod destination if unknown.

Run the checker yourself and read the output:

uv run python ${CLAUDE_PLUGIN_ROOT}/tools/check_dialect.py transformations/<dataset>_to_cdm.py --read <dev_dialect> --write <prod_dialect>

Expected output looks like:

Dialects: duckdb -> bigquery
Checking 5 transformation(s) from transformations/hubspot_to_cdm.py

[dim_company] OK
[dim_person] WARN: double-quoted identifier "email"; verify destination quoting in bigquery
[fact_activity] ERROR: transpile duckdb->bigquery failed: ...
...

SUMMARY
  warnings: 1
  errors: 1

Walk the user through the results:

  • OK — safe to deploy as-is
  • WARN — portability risk worth inspecting, not a guaranteed failure; common cause is double-quoted identifiers that BigQuery/Snowflake handle differently
  • ERROR — the SQL will likely fail on the target destination; rewrite to ANSI SQL (see section 1b)
  • skipped — SQL is dynamically constructed and can't be statically analysed; inspect those functions manually

This check catches the most common issues, but does not replace inspecting Relation.to_sql() output if deployment still fails after fixes:

If either dlt destination is not covered by SQLGlot, the script stops and prints the available SQLGlot dialects. In that case, inspect dlt's actual Relation.to_sql() output or run a target-destination test pipeline instead. SQLGlot supports 31+ dialects — see https://sqlglot.com/sqlglot.html for dialect names.

1b. Common dialect-specific patterns to fix

Rewrite these to ANSI SQL so SQLGlot can transpile them to any destination:

Dialect-specific pattern Portable ANSI alternative
x::TEXT, x::INT (cast shorthand) CAST(x AS VARCHAR), CAST(x AS INTEGER)
IFNULL(a, b) COALESCE(a, b)
ILIKE LOWER(x) LIKE LOWER(y)
INT64 / STRING / FLOAT64 as type names BIGINT / VARCHAR / DOUBLE
EPOCH_MS(), STRFTIME(), LIST_AGG(), etc. No ANSI equivalent — use query_dialect (see below)
QUALIFY clause Wrap in subquery with WHERE on the window result

1c. When dialect-specific SQL is unavoidable

If a transformation genuinely requires a dialect-specific function with no ANSI equivalent, declare the source dialect with query_dialect so SQLGlot knows how to transpile:

@dlt.hub.transformation
def my_transform(dataset: dlt.Dataset):
    yield dataset(
        "SELECT STRFTIME(created_at, '%Y-%m') AS month FROM events",
        query_dialect="duckdb",  # tells dlt this SQL is DuckDB dialect
    )

If SQLGlot raises UnsupportedError or logs warnings at unsupported_level, the construct has no mapping to the target dialect and must be rewritten to an ANSI equivalent or handled in application code.

References:

2. Pipeline failure recovery

Use this escalation order. Do not skip steps.

Step 1: Inspect failures

uv run dlthub local pipeline failed-jobs <pipeline_name>
uv run dlthub local pipeline trace <pipeline_name>

Read the error messages before taking any recovery action. Most failures are fixable without touching destination state.

Step 2: Clear stale packages

If a prior run failed mid-load and left pending packages that keep retrying old (broken) SQL:

uv run dlthub local pipeline drop-pending-packages <pipeline_name>

Re-run after clearing. If the underlying SQL was fixed, this is often all that is needed.

Step 3: Reconcile local state

If local pipeline state has drifted from the destination (e.g. after a partial load or schema change):

uv run dlthub local pipeline sync <pipeline_name>

If no recoverable destination state exists, sync may not resolve partial retries — use drop-pending-packages first.

Step 4: Selective drop (last resort)

DuckDB only: if the failure is "Parser Error: Adding columns with constraints not yet supported", dlthub local pipeline drop will not fix it — go to section 3's DuckDB workaround instead.

Only if the steps above do not resolve the failure and incorrect schema or tables were already loaded to the destination:

uv run dlthub local pipeline drop <pipeline_name> <resource>   # drop a specific resource
uv run dlthub local pipeline drop <pipeline_name> --drop-all   # only with explicit user confirmation

Safety rules before dropping:

  • Prefer dropping specific resources over --drop-all
  • Confirm pipeline name, destination, dataset, and which resources will be dropped before executing
  • drop removes destination tables and resets matching local state — this forces a full reload and may remove good data alongside bad
  • If uncertain which resources are safe to drop, stop and ask the user before executing
  • After drop: re-run transformations and validate schema/tables before further loads

References:

3. Missing columns and schema issues

dlthub silently drops columns it cannot type-infer — no error, no warning. Two root causes:

3a. NULL-only columns

When a column is NULL-only on the first run and no columns= hint was provided, dlthub strips the column from the schema. Subsequent runs write data but the column is absent.

3b. Computed / derived columns

When a transformation uses derived expressions — md5(), strftime(), TRY_CAST, CASE WHEN ... END, aggregates with aliases, or function chains — dlthub cannot infer the output type from the SQL alone. Without a columns= hint the column is silently dropped.

Diagnose: compare expected vs actual columns using the MCP get_table_schema tool, or inspect via:

uv run dlthub local pipeline show <pipeline_name>

Identify which computed columns are absent from the output schema. Each missing column needs an explicit columns= hint.

Fix: add columns= hints for every affected column:

@dlt.hub.transformation(
    write_disposition="replace",
    columns={
        "company_sk":   {"data_type": "text",      "nullable": False},
        "joined_at":    {"data_type": "timestamp",  "nullable": True},
        "email_hash":   {"data_type": "text",       "nullable": True},   # md5() result
        "month_bucket": {"data_type": "text",       "nullable": True},   # strftime() result
        "event_count":  {"data_type": "bigint",     "nullable": True},   # COUNT() alias
    },
)
def dim_company(dataset: dlt.Dataset):
    ...

data_type values must match the key type contract established during create-transformation (consistently text or bigint for surrogate keys).

Apply columns= hints for:

  • Any column from a LEFT JOIN (lookup may return NULL)
  • Any cast from string to typed value where the source may be empty
  • Any column that was NULL-only in a prior run
  • Any computed or derived column: md5(), strftime(), TRY_CAST, CASE WHEN, aggregate aliases, function chains

Known DuckDB failure: "Parser Error: Adding columns with constraints not yet supported"

This error surfaces when re-running a transformation with new or modified columns= hints against DuckDB — DuckDB's ALTER TABLE cannot add constrained columns. The workaround is to drop the dataset and re-run from scratch:

# DuckDB only — this drops all tables in the dataset
duckdb <path_to_db_file> "DROP SCHEMA <dataset_name> CASCADE;"

Then re-run the transformation script. This is DuckDB-specific behavior and will not occur on cloud destinations (BigQuery, Snowflake, Postgres). Do not use dlthub local pipeline drop for this error — it does not clear the DuckDB schema and will not resolve the constraint conflict.

Reference: https://dlthub.com/docs/hub/transformations.md

4. Validate transformation output

After fixing an issue and re-running, validate the output using the same checks as the happy path — see create-transformation Step 9.

Install via CLI
npx skills add https://github.com/dlt-hub/dlthub-ai-workbench --skill debug-transformation
Repository Details
star Stars 49
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator