name: duckdb-patterns description: "DuckDB patterns for embedded OLAP analytics: in-process SQL on Parquet/CSV/JSON files, window functions, ASOF joins, dbt-duckdb integration, Dagster assets, Python/Node.js APIs, and performance tuning. Zero infrastructure required — analytical power without a server."
DuckDB Patterns
Scope: Embedded analytical SQL — local analytics, ELT pipelines, file querying, and data transformation without infrastructure. For transactional database patterns, see postgres-patterns. For orchestrating pipelines on top of DuckDB, see data-engineering.
When to Activate
- Running analytical queries without a database server
- Querying Parquet, CSV, or JSON files directly with SQL
- Building local or CI-friendly data pipelines
- Using dbt with DuckDB as the analytical backend
- Writing Dagster assets that produce or consume DuckDB tables
- Replacing a heavyweight OLAP database for single-node workloads
Overview
DuckDB is an embedded, in-process OLAP database — SQLite for analytics. It runs inside your Python, Node.js, or CLI process with no server, no cloud dependency, and no operational overhead.
[Source files] [DuckDB] [Output]
Parquet / CSV ─────► SQL queries ────► DataFrame
JSON / S3 ─────► window fns ────► Parquet
Postgres (FDW) ─────► aggregations ────► Visualization
When to choose DuckDB over Postgres for analytics:
- Queries scan many rows but few columns (columnar wins)
- No persistent server in the environment (local scripts, CI, notebooks)
- Direct file analysis without import/ingest step
- Sub-second aggregations on hundreds of millions of rows locally
Installation
# Python
pip install duckdb
# Node.js
npm install duckdb
# CLI
brew install duckdb # macOS
Core SQL Patterns
In-memory vs persisted database
import duckdb
# In-memory (default) — disappears when process ends
conn = duckdb.connect()
# Persisted — survives restarts, sharable
conn = duckdb.connect("analytics.duckdb")
Direct file querying (no import needed)
-- Query Parquet directly
SELECT date_trunc('month', event_date) AS month,
count(*) AS events
FROM read_parquet('data/events/*.parquet')
GROUP BY 1
ORDER BY 1;
-- Query CSV
SELECT * FROM read_csv('exports/users.csv', header=true, auto_detect=true);
-- Query JSON
SELECT json_extract(payload, '$.user_id') AS user_id
FROM read_ndjson('logs/events.ndjson');
-- Query S3 (with httpfs extension)
INSTALL httpfs; LOAD httpfs;
SELECT * FROM read_parquet('s3://my-bucket/data/*.parquet');
Window functions
-- Rolling 7-day retention
SELECT
user_id,
event_date,
count(*) OVER (
PARTITION BY user_id
ORDER BY event_date
RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
) AS events_last_7d,
-- First event ever (for cohort analysis)
min(event_date) OVER (PARTITION BY user_id) AS first_seen
FROM events;
-- Percentile distribution
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM request_logs;
ASOF join (time-series matching)
-- Match each event to the most recent price at or before event_time
SELECT e.user_id, e.event_time, p.price
FROM events e
ASOF JOIN prices p
ON e.symbol = p.symbol AND e.event_time >= p.price_time;
Pivot / unpivot
-- Pivot: rows → columns
PIVOT monthly_revenue
ON month
USING sum(revenue)
GROUP BY product;
-- Unpivot: columns → rows
UNPIVOT metrics
ON (clicks, impressions, conversions)
INTO NAME metric VALUE value;
Python Integration
With pandas
import duckdb
import pandas as pd
conn = duckdb.connect()
# Read from DataFrame — no copy, zero-overhead
df = pd.read_parquet("events.parquet")
result = conn.execute("""
SELECT date_trunc('day', ts) AS day, count(*) AS n
FROM df
GROUP BY 1
ORDER BY 1
""").df() # returns a DataFrame
# Write query result to Parquet
conn.execute("""
COPY (SELECT * FROM df WHERE event = 'purchase')
TO 'purchases.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)
""")
With Polars (zero-copy Arrow)
import duckdb
import polars as pl
conn = duckdb.connect()
lf = pl.scan_parquet("events/*.parquet")
# DuckDB reads Polars LazyFrame directly
result = conn.execute("SELECT * FROM lf WHERE amount > 100").pl()
Relation API (composable queries)
conn = duckdb.connect()
events = conn.read_parquet("events/*.parquet")
result = (
events
.filter("event_type = 'purchase'")
.aggregate("user_id, sum(amount) AS total", "user_id")
.order("total DESC")
.limit(100)
)
result.show()
Node.js Integration
import Database from 'duckdb';
import { promisify } from 'util';
const db = new Database(':memory:'); // or 'analytics.duckdb'
const conn = db.connect();
const all = promisify(conn.all.bind(conn));
// Query Parquet from Node.js
const rows = await all(`
SELECT user_id, count(*) AS events
FROM read_parquet('./data/events.parquet')
GROUP BY user_id
ORDER BY events DESC
LIMIT 10
`);
console.log(rows);
db.close();
dbt Integration
DuckDB is the recommended backend for local dbt development and CI pipelines.
profiles.yml
analytics:
target: dev
outputs:
dev:
type: duckdb
path: "analytics.duckdb"
threads: 4
ci:
type: duckdb
path: ":memory:" # ephemeral for CI, fastest possible
threads: 2
Install
pip install dbt-duckdb
dbt debug # verify connection
dbt build # run models + tests
dbt model reading external Parquet
-- models/staging/stg_events.sql
-- dbt-duckdb can reference external files directly
{{ config(materialized='view') }}
SELECT *
FROM read_parquet('{{ env_var("DATA_PATH") }}/events/*.parquet')
Incremental model with DuckDB
-- models/marts/fct_daily_active_users.sql
{{ config(
materialized='incremental',
unique_key='day',
incremental_strategy='delete+insert'
) }}
SELECT
date_trunc('day', event_time) AS day,
count(DISTINCT user_id) AS dau
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
WHERE event_time >= (SELECT max(day) FROM {{ this }})
{% endif %}
GROUP BY 1
Dagster Integration
# assets/duckdb_assets.py
from dagster import asset, AssetExecutionContext
from dagster_duckdb import DuckDBResource
import pandas as pd
@asset(
group_name="analytics",
required_resource_keys={"duckdb"},
)
def daily_active_users(context: AssetExecutionContext, duckdb: DuckDBResource) -> None:
with duckdb.get_connection() as conn:
conn.execute("""
CREATE OR REPLACE TABLE dau AS
SELECT date_trunc('day', event_time) AS day,
count(DISTINCT user_id) AS dau
FROM read_parquet('/data/events/*.parquet')
GROUP BY 1
""")
count = conn.execute("SELECT count(*) FROM dau").fetchone()[0]
context.log.info(f"Computed DAU for {count} days")
# resources (definitions.py)
from dagster_duckdb import DuckDBResource
duckdb_resource = DuckDBResource(database="analytics.duckdb")
Performance Tuning
Threading and memory
import duckdb
conn = duckdb.connect()
# Use all cores for large queries (default: all)
conn.execute("SET threads = 8")
# Cap memory usage (default: 80% of RAM)
conn.execute("SET memory_limit = '4GB'")
# Temp directory for spill-to-disk on large joins
conn.execute("SET temp_directory = '/tmp/duckdb'")
Persistent database for repeated queries
# Create a persistent database from Parquet once
conn = duckdb.connect("analytics.duckdb")
conn.execute("""
CREATE TABLE IF NOT EXISTS events AS
SELECT * FROM read_parquet('s3://bucket/events/*.parquet')
""")
# Index frequently filtered columns
conn.execute("CREATE INDEX IF NOT EXISTS idx_user ON events(user_id)")
Export to Parquet (efficient handoff)
conn.execute("""
COPY (
SELECT * FROM events WHERE event_date >= '2026-01-01'
)
TO 'recent_events.parquet'
(FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000)
""")
Common Anti-Patterns
| Anti-pattern | Problem | Fix |
|---|---|---|
| Importing CSV into DuckDB before querying | Unnecessary step | Use read_csv() directly |
| Using DuckDB for OLTP (many small writes) | Not optimized for row-level updates | Use PostgreSQL for transactional data |
| Single-threaded queries on large files | Slow | DuckDB uses all cores by default — don't force SET threads = 1 |
| Storing DuckDB file in tmpfs/ramdisk | Data loss on restart | Use persistent path for anything not ephemeral |
| Running DuckDB as a server for multiple writers | DuckDB is single-writer | Use Postgres for multi-process write access |
Related
- data-engineering — dbt + Dagster pipeline orchestration
- postgres-patterns — transactional database patterns
- analytics-workflow — dashboards and product metrics on top of these pipelines