duckdb-transformation - SKILL.md Agent Skill

name: DuckDB Transformation description: > Expert knowledge for writing, optimizing, and migrating DuckDB transformations in Keboola. Covers SQL dialect, block orchestration, dynamic backends, Parquet format, case sensitivity, Snowflake-to-DuckDB migration, type casting patterns, and best practices. Use when the user is writing DuckDB SQL, migrating from Snowflake, debugging DuckDB transformations, or needs guidance on DuckDB-specific syntax and configuration in Keboola. version: 1.0.0

DuckDB Transformation Knowledge

Provide expertise on writing, configuring, and optimizing DuckDB transformations in Keboola Connection.

Overview

DuckDB (keboola.duckdb-transformation) is an in-process analytical database (OLAP) with columnar storage, designed for fast SQL analytics. It is a cost-effective alternative to Snowflake for small-to-medium datasets.

Status: BETA --- breaking changes may occur.

Key advantages:

In-process execution (no external database server)
Columnar storage optimized for analytical queries
Block-based orchestration with automatic dependency analysis
Parallel execution of independent scripts
Rich SQL dialect with modern quality-of-life extensions

Configuration

Component ID

keboola.duckdb-transformation

Configuration Structure

{
  "parameters": {
    "blocks": [
      {
        "name": "Block Name",
        "codes": [
          {
            "name": "Script Name",
            "script": [
              "CREATE TABLE \"output\" AS",
              "SELECT * FROM \"input_table\";"
            ]
          }
        ]
      }
    ],
    "threads": 4,
    "max_memory_mb": 2048,
    "image_tag": "parquet",
    "dtypes_infer": false
  },
  "storage": {
    "input": {
      "tables": [
        {
          "source": "in.c-bucket.table",
          "destination": "input_table"
        }
      ]
    },
    "output": {
      "tables": [
        {
          "source": "output",
          "destination": "out.c-bucket.result"
        }
      ]
    }
  }
}

Settings

Timeout --- maximum execution time (default: 1 hour)
Backend size --- memory allocation (see Dynamic Backends)
DuckDB version --- latest (default) or pinned (e.g., 1.5.2, 1.4.4)
Automatic data types --- auto-assign types to output tables
Use parquet for input tables --- Parquet instead of CSV (recommended for >1 GB)
Infer input table data types --- infer types from input tables (useful for non-typed Storage)
Debug mode --- enable debug logging

Block-Based Orchestration

Blocks execute sequentially (one after another)
Scripts within a block execute in parallel (when no dependencies)
Uses SQLGlot for automatic SQL analysis
Builds a DAG (Directed Acyclic Graph) of dependencies
Execution order is automatically optimized

Dynamic Backends

Backend Size	Memory	Recommended For
XSmall	8 GB	Small datasets, testing
Small (default)	16 GB	Most use cases
Medium	32 GB	Large datasets (5 GB+)
Large	113.6 GB	Very large datasets (10 GB+)

Start with Small, scale up as needed. Dynamic backends are not available on Free/Pay As You Go plans.

DuckDB automatically detects available CPU and memory. Manual override via threads and max_memory_mb parameters.

Parquet Format

Enable Parquet for significantly better performance with large datasets:

Faster processing than CSV
Lower memory usage
Columnar storage optimized for analytics

Recommendation: Always use Parquet for datasets > 1 GB.

Configuration: set "image_tag": "parquet" or toggle Use parquet for input tables in UI.

Infer Input Table Data Types

Keboola Storage tables can be non-typed (all columns as VARCHAR). When enabled, DuckDB infers actual types (INTEGER, FLOAT, DATE, etc.) so aggregate functions and type-specific operations work without manual casting.

Without inference: SUM("amount") fails because amount is VARCHAR. With inference: SUM("amount") works because DuckDB detects it as numeric.

Sync Actions

Four debugging actions available without running the full transformation:

syntax_check --- validate SQL syntax without execution
lineage_visualization --- markdown diagram of data dependencies
execution_plan_visualization --- planned execution order with blocks and batches
expected_input_tables --- list of expected input tables based on SQL analysis

SQL Writing Rules

Semicolons Between Statements

Each SQL statement MUST be terminated with a semicolon (;). Multiple statements in a single script must be properly separated:

-- Correct: each statement ends with a semicolon
CREATE TABLE "output_a" AS SELECT * FROM "input_a";

CREATE TABLE "output_b" AS SELECT * FROM "input_b";

Missing semicolons cause syntax errors.

Case Sensitivity (CRITICAL)

Table names:

Unquoted → lowercase (MyTable → mytable)
Quoted → case-sensitive ("MyTable" stays MyTable)

Column names:

Always case-sensitive regardless of quoting
SELECT columnName and SELECT ColumnName refer to different columns

This is different from Snowflake (unquoted → uppercase, quoted columns case-insensitive).

Best practice: Use consistent lowercase naming or always quote identifiers.

Type Casting Patterns

For non-typed Storage tables, use safe casting patterns:

-- Safe: handles empty strings without errors
TRY_CAST(NULLIF("column", '') AS DATE)
TRY_CAST(NULLIF("column", '') AS DOUBLE)
TRY_CAST(NULLIF("column", '') AS INT)

-- Shorthand type cast
"column"::BOOLEAN
"column"::INTEGER

-- Regular cast
CAST("column" AS DECIMAL)

Pattern: TRY_CAST(NULLIF("col", '') AS TYPE) --- converts empty strings to NULL before casting, avoiding errors on empty values.

DuckDB SQL Extensions

Quality-of-life extensions not available in all SQL dialects:

-- GROUP BY ALL: auto-groups by all non-aggregated columns
SELECT product, category, SUM(sales) FROM orders GROUP BY ALL;

-- EXCLUDE: select all columns except specific ones
SELECT * EXCLUDE (password, ssn) FROM users;

-- ASOF JOIN: time-series data with non-matching timestamps
SELECT s.player_id, s.score, w.temperature
FROM scores s ASOF JOIN weather w ON s.score_time >= w.timestamp;

-- SUMMARIZE: quick data profiling
SUMMARIZE SELECT * FROM my_table;

Memory Management for Large Datasets

For datasets > 10 GB, configure on-disk processing:

PRAGMA memory_limit='8GB';
PRAGMA temp_directory='/tmp/duckdb_temp';
PRAGMA threads=4;
PRAGMA enable_object_cache;

Snowflake to DuckDB Migration

Data Type Mapping

Snowflake	DuckDB	Notes
`VARIANT`	`JSON` or `VARCHAR`	Semi-structured data
`ARRAY`	`LIST`	Native arrays
`OBJECT`	`STRUCT` or `JSON`	Nested objects
`INTEGER`	`INTEGER`	Same
`VARCHAR`	`VARCHAR`	Same
`TIMESTAMP`	`TIMESTAMP`	Same

SQL Function Differences

Category	Snowflake	DuckDB
Window functions	`QUALIFY ROW_NUMBER() OVER (...) = 1`	Subquery with `WHERE rn = 1`
Null handling	`NVL(a, b)`	`COALESCE(a, b)`
Conditionals	`IFF(cond, a, b)`	`CASE WHEN cond THEN a ELSE b END`
Date add	`DATEADD(day, 7, date)`	`date + INTERVAL '7 days'`
Date diff	`DATEDIFF(day, d1, d2)`	`date_diff('day', d1, d2)`
String agg	`LISTAGG(...) WITHIN GROUP (ORDER BY ...)`	`STRING_AGG(...)`
Transient tables	`CREATE TRANSIENT TABLE`	Not supported (use regular or TEMP)

Case Sensitivity Migration

Snowflake:

Unquoted → UPPERCASE (MyTable → MYTABLE)
Quoted → case-sensitive ("MyTable" ≠ "MYTABLE")

DuckDB:

Unquoted → lowercase (MyTable → mytable)
Quoted tables → case-sensitive
Columns → always case-sensitive regardless of quoting

Migration tip: Use consistent lowercase naming: CREATE TABLE mytable ...

Window Functions Migration

-- Snowflake
SELECT * FROM table QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) = 1;

-- DuckDB
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) AS rn
    FROM table
) WHERE rn = 1;

Temporary Table Differences

Snowflake: Temp tables can be in any schema
DuckDB: Temp tables ALWAYS in temp.main schema (cannot specify another)
Both: Auto-drop after session end
DuckDB name collision: Temp table has priority over regular table with same name. Access regular table via memory.main.table_name

Automatic Migration

SQLGlot can automatically convert ~85% of Snowflake SQL to DuckDB syntax. However, complex queries, Snowflake-specific functions, and edge cases may not convert correctly.

Transformation Migration Component (keboola.app-transformation-migration):

Experimental component that automates Snowflake → DuckDB migration
Handles SQL dialect translation via SQLGlot
Preserves input/output table mappings and runtime settings
If translation fails for a code block, original SQL is preserved as comments

Important: The automatic migration is not 100% accurate. Roughly 25% of migrated transformations will require manual adjustments --- typically:

Case sensitivity issues
Unsupported Snowflake-specific functions
Data type differences
Column alias handling in GROUP BY

Always review and test migrated configurations before using in production.

Functions That Work the Same

SUBSTRING(str, start, len)
POSITION('x' IN str)
CONCAT(a, b) (both handle NULL gracefully)
a || b (both return NULL if any input is NULL)
LIMIT n
FETCH FIRST n ROWS
CREATE TEMP TABLE / CREATE TEMPORARY TABLE
CREATE OR REPLACE TABLE
CREATE TABLE IF NOT EXISTS
CREATE TABLE AS SELECT (CTAS)

Real-World Example: CRM Data Transformation

Typical transformation processing CRM data (e.g., HubSpot):

/* companies */
CREATE TABLE "out_companies" AS
SELECT
  "companyId",
  "name",
  "website",
  TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate",
  "isDeleted"::BOOLEAN AS "isDeleted"
FROM "companies";

/* contacts */
CREATE TABLE "out_contacts" AS
SELECT
  "canonical_vid",
  "firstname",
  "lastname",
  "email",
  TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate",
  "hs_analytics_source" AS "email_source",
  "associatedcompanyid",
  "lifecyclestage"
FROM "contacts";

/* deals */
CREATE TABLE "out_deals" AS
SELECT
  "dealId",
  "isDeleted"::BOOLEAN AS "isDeleted",
  "dealname",
  TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate",
  TRY_CAST(NULLIF("closedate", '') AS DATE) AS "closedate",
  "dealtype",
  TRY_CAST(NULLIF("amount", '') AS DOUBLE) AS "amount",
  "pipeline",
  "dealstage",
  "hubspot_owner_id",
  "hs_analytics_source"
FROM "deals";

/* pipeline stages */
CREATE TABLE "out_stages" AS
SELECT
  "stageId",
  "label",
  TRY_CAST(NULLIF("displayOrder", '') AS INT) AS "displayOrder",
  TRY_CAST(NULLIF("probability", '') AS DOUBLE) AS "probability",
  "closedWon"::BOOLEAN AS "closedWon"
FROM "pipeline_stages";

/* deals-contacts association */
CREATE TABLE "out_deals_contacts" AS
SELECT
  "contact_vid",
  "dealId"
FROM "deals_contacts_list";

/* activities */
CREATE TABLE "out_activities" AS
SELECT
  "engagement_id",
  "metadata_subject",
  TRY_CAST(NULLIF("engagement_createdAt", '') AS DATE) AS "engagement_createdAt",
  "metadata_durationMilliseconds",
  "associations_contactIds",
  "associations_dealIds",
  "associations_ownerIds"
FROM "activities";

Best Practices

Start with Small backend, scale up as needed
Use Parquet for datasets > 1 GB
Terminate every statement with ; when multiple statements in a script
Use consistent casing for identifiers (prefer lowercase)
Quote identifiers with mixed case or special characters
Filter and project early --- apply WHERE close to source, select only needed columns
Use EXPLAIN to analyze execution plans for expensive queries
Split complex transformations into smaller steps, each producing one output table
Use consistent naming for output tables (e.g., stg_, fact_, dim_ prefixes)
Use sync actions for debugging (syntax_check, lineage_visualization, execution_plan)
Test migrations carefully when moving from Snowflake
Use TRY_CAST(NULLIF(...) AS TYPE) pattern for safe type conversion of potentially empty values

When to Use DuckDB vs. Snowflake

Choose DuckDB for:

Ad-hoc analysis and small-to-medium datasets (up to a few TB)
Rapid prototyping of transformations
Projects with limited budgets
Development and testing

Choose Snowflake for:

Very large datasets (TB+)
Complex enterprise workloads
Sharing warehouses across multiple processes
Maximum scalability
Advanced Snowflake-specific features

DuckDB is NOT for transactional workloads. It is an OLAP database optimized for SELECT statements. Avoid frequent INSERT/UPDATE operations.

Reference Links

DuckDB Official
SQLGlot --- SQL transpilation
DuckDB Snowflake Extension --- for local development
Component README
Migration Component README
Keboola DuckDB Documentation