duckdb - SKILL.md Agent Skill

name: duckdb description: This skill should be used when users need to work with the DuckDB CLI for querying, analyzing, or transforming data. Use this skill for executing SQL queries, importing/exporting data, configuring DuckDB settings, or performing database operations at the command line.

DuckDB CLI Skill

This skill provides comprehensive guidance for using DuckDB's command-line interface, a fast, in-process SQL OLAP database management system designed for analytical workloads.

About DuckDB CLI

DuckDB CLI is a single, dependency-free executable that provides a full-featured SQL interface for data analysis. It excels at analytical queries on structured data formats (CSV, Parquet, JSON) and can handle datasets ranging from in-memory operations to out-of-core processing of files larger than available RAM.

Key Capabilities

SQL Analytics: Full SQL support with PostgreSQL-compatible syntax
File Format Support: Native reading of CSV, Parquet, JSON, Excel, and more
In-Memory & Persistent: Work with temporary in-memory databases or persistent files
Zero-Copy Integration: Direct querying of files without importing
Advanced Features: Window functions, CTEs, PIVOT/UNPIVOT, aggregations
Data Import/Export: Seamless conversion between formats
Performance: Columnar storage, vectorized execution, and intelligent query optimization

Version Information

Current stable version: 1.4.1 (Ossivalis)

Available for Windows, macOS, and Linux as a single executable with no dependencies.

When to Use This Skill

Use this skill when users:

Need to query CSV, Parquet, JSON, or Excel files using SQL
Want to perform data analysis at the command line
Need to transform, aggregate, or join datasets
Want to convert between data formats (CSV to Parquet, JSON to CSV, etc.)
Ask about database operations, aggregations, or analytics
Need to work with both small and large datasets efficiently
Want to create persistent databases for analytical workloads
Need to execute SQL queries from scripts or command line

Installation

Download and Setup

Download DuckDB CLI from the installation page under the CLI tab
Available as pre-built binaries for:
- Windows (x64, ARM64)
- macOS (x64, ARM64/M1)
- Linux (x64, ARM64)
Extract the executable (single file, no installation needed)
Run from terminal: duckdb (or ./duckdb in PowerShell/POSIX shells)

Package Managers

# Homebrew (macOS/Linux)
brew install duckdb

# Chocolatey (Windows)
choco install duckdb

# Scoop (Windows)
scoop install duckdb

Getting Started

Launching DuckDB

In-Memory Database (Temporary):

duckdb

Creates a temporary database that won't persist after closing. Ideal for quick analysis.

Persistent Database:

duckdb my_database.duckdb

Creates or opens a database file. All changes are automatically saved.

Opening Existing Database:

duckdb path/to/existing.duckdb

Exiting DuckDB

Press Ctrl+D, Ctrl+C, or type:

.exit

Persistent databases automatically checkpoint (save) when closing.

SQL Query Execution

Interactive SQL

Enter SQL statements followed by semicolons:

SELECT 42 AS answer;

Multi-line Statements:

SELECT
    column1,
    column2
FROM my_table
WHERE condition = true;

Non-Interactive Execution

Execute SQL directly from command line:

duckdb :memory: "SELECT 42 AS answer"

Pipe SQL from file:

duckdb < commands.sql

Execute and output results:

duckdb my_db.duckdb "SELECT * FROM table" > results.csv

Reading Data Files

DuckDB can query files directly without importing:

CSV Files

-- Read CSV file
SELECT * FROM read_csv('data.csv');

-- With options
SELECT * FROM read_csv('data.csv',
    header = true,
    delimiter = ',',
    auto_detect = true
);

-- Create table from CSV
CREATE TABLE my_table AS
SELECT * FROM read_csv('data.csv');

Parquet Files

-- Read Parquet file
SELECT * FROM read_parquet('data.parquet');

-- Read multiple Parquet files
SELECT * FROM read_parquet('data/*.parquet');

-- Hive-partitioned Parquet
SELECT * FROM read_parquet('data/year=*/month=*/*.parquet');

JSON Files

-- Read JSON file
SELECT * FROM read_json('data.json');

-- NDJSON (newline-delimited JSON)
SELECT * FROM read_json('data.ndjson', format='newline_delimited');

Excel Files

-- Read Excel file
SELECT * FROM read_excel('data.xlsx');

-- Specific sheet
SELECT * FROM read_excel('data.xlsx', sheet='Sheet2');

Reading from stdin

cat data.csv | duckdb -c "SELECT * FROM read_csv('/dev/stdin')"

Common SQL Operations

Basic Queries

-- Select specific columns
SELECT name, age, city FROM users;

-- Filter rows
SELECT * FROM users WHERE age > 25;

-- Sort results
SELECT * FROM users ORDER BY age DESC;

-- Limit results
SELECT * FROM users LIMIT 10;

Aggregations

-- Count rows
SELECT COUNT(*) FROM users;

-- Group by with aggregation
SELECT city, COUNT(*) as count, AVG(age) as avg_age
FROM users
GROUP BY city;

-- DuckDB-friendly GROUP BY ALL
SELECT city, COUNT(*) as count
FROM users
GROUP BY ALL;

Joins

-- Inner join
SELECT u.name, o.order_id
FROM users u
JOIN orders o ON u.user_id = o.user_id;

-- Left join
SELECT u.name, o.order_id
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id;

-- Multiple joins
SELECT u.name, o.order_id, p.product_name
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
LEFT JOIN products p ON o.product_id = p.product_id;

Window Functions

-- Row number
SELECT name, age,
    ROW_NUMBER() OVER (ORDER BY age) as rank
FROM users;

-- Partition by
SELECT name, department, salary,
    AVG(salary) OVER (PARTITION BY department) as dept_avg
FROM employees;

-- Quantiles with NTILE
SELECT name, score,
    NTILE(4) OVER (ORDER BY score) as quartile
FROM students;

Pivoting

-- UNPIVOT (wide to long)
UNPIVOT sales
ON jan, feb, mar
INTO
    NAME month
    VALUE amount;

-- PIVOT (long to wide)
PIVOT sales
ON month
USING SUM(amount);

Common Table Expressions (CTEs)

WITH high_earners AS (
    SELECT * FROM employees WHERE salary > 100000
),
departments AS (
    SELECT dept_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY dept_id
)
SELECT h.name, d.avg_salary
FROM high_earners h
JOIN departments d ON h.dept_id = d.dept_id;

Data Export

Export to CSV

COPY (SELECT * FROM my_table) TO 'output.csv' (HEADER, DELIMITER ',');

Export to Parquet

COPY (SELECT * FROM my_table) TO 'output.parquet' (FORMAT PARQUET);

Export to JSON

COPY (SELECT * FROM my_table) TO 'output.json';

Dot Commands

Dot commands provide CLI-specific functionality and start with a period (.):

Essential Dot Commands

.help                    -- Show all available dot commands
.tables                  -- List all tables in database
.schema [table_name]     -- Show table schema
.open path/to/db.duckdb  -- Switch to different database
.read file.sql           -- Execute SQL from file
.exit                    -- Exit DuckDB

Output Formatting

.mode duckbox           -- Default table format
.mode csv               -- CSV output
.mode json              -- JSON output
.mode markdown          -- Markdown tables
.mode latex             -- LaTeX tables
.mode insert            -- SQL INSERT statements
.mode line              -- One value per line

Example:

.mode csv
SELECT * FROM users;
.mode duckbox

Output Redirection

-- Redirect all output to file
.output results.txt
SELECT * FROM users;
.output stdout         -- Reset to terminal

-- Redirect next query only
.once results.csv
SELECT * FROM users;

Custom Prompts

.prompt 'duckdb> '
.prompt '> '

Configuration

Startup Configuration

Create ~/.duckdbrc for automatic configuration on startup:

.mode duckbox
.prompt 'duck> '

Skip default config:

duckdb -init /dev/null

Use custom config:

duckdb -init my_config.sql

Session Settings

-- Enable/disable progress bar
SET enable_progress_bar = true;

-- Set memory limit
SET memory_limit = '4GB';

-- Set threads
SET threads = 4;

-- Enable external access (for environment variables)
SET enable_external_access = true;

Advanced Features

Environment Variables

-- Get environment variable
SELECT getenv('HOME') AS home;
SELECT getenv('USER') AS username;

Note: Requires enable_external_access = true (default).

Prepared Statements

PREPARE my_query AS
SELECT * FROM users WHERE age > $1;

EXECUTE my_query(25);

Query Explanation

-- Show query plan
EXPLAIN SELECT * FROM users WHERE age > 25;

-- Detailed analysis
EXPLAIN ANALYZE SELECT * FROM users WHERE age > 25;

Sampling

-- Random sample (1%)
SELECT * FROM large_table USING SAMPLE 1%;

-- Sample with seed (reproducible)
SELECT * FROM large_table USING SAMPLE 1000 ROWS (SEED 42);

Creating Views

CREATE VIEW adult_users AS
SELECT * FROM users WHERE age >= 18;

SELECT * FROM adult_users;

DuckDB-Specific SQL Enhancements

Friendly Syntax Features

FROM-first syntax:

FROM users
SELECT name, age
WHERE age > 25;

GROUP BY ALL:

SELECT city, department, COUNT(*)
FROM employees
GROUP BY ALL;

ORDER BY ALL:

SELECT name, age, salary
FROM employees
ORDER BY ALL;

List and Struct Types

-- Create list column
SELECT [1, 2, 3] AS my_list;

-- Create struct column
SELECT {'name': 'John', 'age': 30} AS person;

-- Unnest list
SELECT UNNEST([1, 2, 3]) AS value;

Working with Large Datasets

Out-of-Core Processing

DuckDB can process datasets larger than RAM by using disk-backed databases:

duckdb large_db.duckdb

-- Create table from large CSV
CREATE TABLE large_table AS
SELECT * FROM read_csv('very_large_file.csv');

-- Query with partitioning benefits
SELECT * FROM read_parquet('data/year=*/month=*/*.parquet')
WHERE year = 2024 AND month = 1;

Performance Tips

Use Parquet for large datasets: More efficient than CSV
Leverage partitioning: Hive-partitioned files reduce I/O
Filter early: DuckDB pushes filters down to file reading
Use sampling for exploration: Test queries on samples first
Select only needed columns: Columnar storage makes this very efficient

Practical Examples

Data Cleaning

-- Remove duplicates
CREATE TABLE clean_users AS
SELECT DISTINCT * FROM users;

-- Handle missing values
SELECT
    COALESCE(name, 'Unknown') as name,
    COALESCE(age, 0) as age
FROM users;

-- String cleaning
SELECT
    TRIM(name) as name,
    LOWER(email) as email,
    REGEXP_REPLACE(phone, '[^0-9]', '') as phone
FROM contacts;

Data Transformation

-- Create derived columns
SELECT
    name,
    age,
    CASE
        WHEN age < 18 THEN 'Minor'
        WHEN age < 65 THEN 'Adult'
        ELSE 'Senior'
    END as age_group
FROM users;

-- Date operations
SELECT
    order_date,
    EXTRACT(YEAR FROM order_date) as year,
    EXTRACT(MONTH FROM order_date) as month,
    DATE_TRUNC('month', order_date) as month_start
FROM orders;

Multi-File Analysis

-- Combine multiple CSV files
SELECT * FROM read_csv('sales_*.csv');

-- Join files from different formats
SELECT
    u.name,
    o.order_id,
    o.amount
FROM read_csv('users.csv') u
JOIN read_parquet('orders.parquet') o
    ON u.user_id = o.user_id;

Data Quality Checks

-- Check for duplicates
SELECT user_id, COUNT(*)
FROM users
GROUP BY user_id
HAVING COUNT(*) > 1;

-- Find missing values
SELECT
    COUNT(*) as total_rows,
    COUNT(name) as name_count,
    COUNT(*) - COUNT(name) as name_missing,
    COUNT(email) as email_count,
    COUNT(*) - COUNT(email) as email_missing
FROM users;

-- Value distribution
SELECT age, COUNT(*) as frequency
FROM users
GROUP BY age
ORDER BY frequency DESC;

Integration with Python and R

While this skill focuses on the CLI, DuckDB integrates seamlessly with programming languages:

Python:

import duckdb
con = duckdb.connect('my_db.duckdb')
result = con.execute("SELECT * FROM users").df()  # Returns pandas DataFrame

library(DBI)
con <- dbConnect(duckdb::duckdb(), "my_db.duckdb")
result <- dbGetQuery(con, "SELECT * FROM users")

Troubleshooting

Common Issues

"Table does not exist" errors:

Verify file path is correct
Use absolute paths or ensure working directory is correct
Check file format matches read function (read_csv vs read_parquet)

Memory errors:

Use persistent database instead of :memory:
Reduce dataset size with WHERE clauses or sampling
Set memory limit: SET memory_limit = '2GB';

Performance issues:

Use EXPLAIN to check query plan
Ensure proper indexes on join columns
Use appropriate file formats (Parquet > CSV for large data)
Consider partitioning large files

Character encoding issues:

Specify encoding in read_csv: encoding='UTF-8'
Use normalize_names=true for column name cleanup

Best Practices

Start with exploration: Use LIMIT and sampling for initial analysis
Use appropriate formats: Parquet for large datasets, CSV for small/human-readable
Leverage partitioning: Organize large datasets by date, category, etc.
Test queries incrementally: Build complex queries step-by-step
Save intermediate results: Create tables/views for multi-step analysis
Use CTEs for readability: Break complex queries into named steps
Configure .duckdbrc: Set preferred defaults for consistent experience
Check query plans: Use EXPLAIN for performance-critical queries
Version control SQL: Save reusable queries in .sql files
Document complex queries: Add comments with -- or /* */

Resources and References

Official Documentation

Additional Learning

Grant McDermott's DuckDB Guide
DuckDB Blog for latest features and use cases
Community examples on GitHub

Quick Reference Card

-- Launch CLI
duckdb                           # In-memory
duckdb mydb.duckdb              # Persistent

-- Read files
SELECT * FROM read_csv('file.csv');
SELECT * FROM read_parquet('file.parquet');
SELECT * FROM read_json('file.json');

-- Export data
COPY (...) TO 'output.csv' (HEADER);
COPY (...) TO 'output.parquet' (FORMAT PARQUET);

-- Dot commands
.help                           # Show help
.tables                         # List tables
.schema table_name              # Show schema
.mode csv                       # CSV output
.output file.txt                # Redirect output
.read script.sql                # Run SQL file
.exit                           # Quit

-- Useful settings
SET threads = 4;
SET memory_limit = '4GB';
SET enable_progress_bar = true;

-- Query optimization
EXPLAIN query;                  # Show plan
USING SAMPLE 1%;               # Sample data
GROUP BY ALL;                   # Auto-group