dlt-expert

star 1

Expert guidance for dlt (data load tool), the Python library for building data pipelines. Create pipelines from REST APIs, SQL databases, cloud storage, and Python data structures to destinations like DuckDB, BigQuery, Snowflake, and Postgres. Covers rest_api_source, RESTClient, authentication, pagination, incremental loading, schema evolution, and pipeline deployment. Use when working with dlt, data pipelines, ETL, ELT, data loading, REST API integration, or when user mentions dlt, data load tool, pipeline creation, or data extraction.

DecisioNaut By DecisioNaut schedule Updated 2/18/2026

name: dlt-expert description: Expert guidance for dlt (data load tool), the Python library for building data pipelines. Create pipelines from REST APIs, SQL databases, cloud storage, and Python data structures to destinations like DuckDB, BigQuery, Snowflake, and Postgres. Covers rest_api_source, RESTClient, authentication, pagination, incremental loading, schema evolution, and pipeline deployment. Use when working with dlt, data pipelines, ETL, ELT, data loading, REST API integration, or when user mentions dlt, data load tool, pipeline creation, or data extraction. license: Apache-2.0 compatibility: Requires Python 3.9-3.14, pip or uv package manager

dlt Expert

Expert guidance for building data pipelines with dlt (data load tool), the Python library that automates tedious data loading tasks.

Core Concepts

What is dlt?

dlt (data load tool) is a mature, open-source Python library for building data pipelines. It's:

  • A library, not a platform - Add it to your code, respects existing workflows
  • Lightweight - No backends, containers, or black boxes required
  • Pythonic - Clean interfaces, human-readable formats, no side effects
  • LLM-native - Built from the ground up to work with AI assistants

Key Features:

  • Schema inference and evolution
  • Automatic data normalization for nested structures
  • Incremental loading with state management
  • 40+ verified sources and destinations
  • Supports Python 3.9-3.14

dlt Principles

Core philosophy:

  1. Multiply, don't add - dlt does more work so users do less
  2. No black boxes - Everything is transparent and inspectable
  3. Respect existing workflows - Integrate seamlessly with existing code
  4. Mature and trusted - Used by thousands of engineers in production

Quick Start

Installation

pip install dlt
# or with uv
uv add dlt

Simple Example

Load chess player data from an API:

import dlt
from dlt.sources.helpers import requests

# Create pipeline
pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination='duckdb',
    dataset_name='player_data'
)

# Grab data from API
data = []
for player in ['magnuscarlsen', 'rpragchess']:
    response = requests.get(f'https://api.chess.com/pub/player/{player}')
    response.raise_for_status()
    data.append(response.json())

# Extract, normalize, and load
pipeline.run(data, table_name='player')

Try it:

Building Data Pipelines

Step 1: Choose Your Approach

dlt offers multiple patterns based on your data source:

For REST APIs:

  • Use rest_api_source() for standard APIs (declarative)
  • Use RESTClient for complex/custom APIs (programmatic)
  • See REST_API.md for detailed patterns

For SQL Databases:

import dlt
from dlt.sources.sql_database import sql_database

source = sql_database(
    "mysql://user:password@localhost/db",
    table_names=["users", "orders"]
)

For Cloud Storage (S3, GCS, Azure):

from dlt.sources.filesystem import filesystem

resource = filesystem(
    bucket_url="s3://example-bucket",
    file_glob="*.csv"
)

For Python Data Structures:

@dlt.resource(write_disposition="replace")
def my_data():
    yield [{"id": 1, "name": "Item 1"}]

Step 2: Configure Authentication

See AUTHENTICATION.md for complete patterns:

  • API Keys (query parameter or header)
  • Bearer Tokens
  • HTTP Basic Auth
  • OAuth 2.0 Client Credentials

Step 3: Handle Pagination

For APIs with multiple pages, dlt provides built-in paginators:

  • JSONLinkPaginator - Follows "next" links
  • PageNumberPaginator - Page-based pagination
  • OffsetPaginator - Offset/limit pagination
  • JSONResponseCursorPaginator - Cursor-based pagination

See PAGINATION.md for details.

Step 4: Implement Incremental Loading

For large datasets, load only new/changed data:

@dlt.resource(
    write_disposition="append",
    primary_key="id"
)
def events(
    updated_at=dlt.sources.incremental("updated_at", initial_value="2024-01-01T00:00:00Z")
):
    # Only fetch records updated since last run
    params = {"updated_since": updated_at.last_value}
    # ... fetch and yield data

See INCREMENTAL_LOADING.md for advanced patterns.

Step 5: Choose Destination

Common destinations:

  • DuckDB - Local analytics, fast testing
  • BigQuery - Google Cloud warehouse
  • Snowflake - Enterprise data warehouse
  • Postgres - PostgreSQL database
  • Redshift - AWS data warehouse

See DESTINATIONS.md for configuration details.

Step 6: Run Pipeline

pipeline = dlt.pipeline(
    pipeline_name="my_pipeline",
    destination="duckdb",
    dataset_name="my_dataset"
)

info = pipeline.run(source)
print(f"Loaded {info}")

Write Dispositions

Control how data is loaded:

Disposition Behavior Use Case
replace Truncate and reload all data Full snapshots, dimension tables
append Add new records without deduplication Immutable logs, events
merge Upsert based on primary key Updating records, CDC

Secrets Management

Never hardcode credentials! Use .dlt/secrets.toml:

# .dlt/secrets.toml
api_key = "your_api_key_here"

[sources.my_api]
base_url = "https://api.example.com"
api_key = "source_specific_key"

[sources.my_api.credentials]
client_id = "your_client_id"
client_secret = "your_client_secret"

Access in code:

api_key = dlt.secrets["api_key"]
token = dlt.secrets["sources.my_api.token"]

Environment variables: dlt automatically maps env vars:

SOURCES__MY_API__API_KEY=your_key

Testing and Inspection

Test Locally with DuckDB

pipeline = dlt.pipeline(
    pipeline_name="test_pipeline",
    destination="duckdb",
    dataset_name="test_data"
)

# Run with limited data
info = pipeline.run(source.with_limit(10))

Inspect Data

# Get connection
conn = pipeline.sql_client()

# Query data
result = conn.execute("SELECT COUNT(*) FROM users").fetchone()
print(f"Total users: {result[0]}")

CLI Commands

# View schema
dlt pipeline <name> schema

# Show data
dlt pipeline <name> show

# Inspect pipeline
dlt pipeline <name> info

Common Patterns

Pattern 1: Simple REST API Pipeline

For standard REST APIs with predictable structure:

from dlt.sources.rest_api import rest_api_source

source = rest_api_source({
    "client": {
        "base_url": "https://api.example.com",
        "auth": {
            "type": "bearer",
            "token": dlt.secrets.value
        }
    },
    "resources": [
        {
            "name": "users",
            "endpoint": {
                "path": "users",
                "paginator": "json_link",
                "data_selector": "data"
            }
        }
    ]
})

pipeline = dlt.pipeline(
    pipeline_name="api_pipeline",
    destination="duckdb"
)
pipeline.run(source)

Pattern 2: Custom REST API with RESTClient

For complex APIs needing custom logic:

from dlt.sources.helpers.rest_client import RESTClient, paginate
from dlt.sources.helpers.rest_client.paginators import JSONResponsePaginator

@dlt.resource(write_disposition="replace")
def users():
    client = RESTClient(base_url="https://api.example.com")
    
    for page in client.paginate(
        "/users",
        paginator=JSONResponsePaginator(next_url_path="pagination.next")
    ):
        yield page

@dlt.source
def my_api_source():
    return [users()]

Pattern 3: Incremental Loading with Merge

For updating existing records:

@dlt.resource(
    write_disposition="merge",
    primary_key="id"
)
def users(
    updated_at=dlt.sources.incremental("updated_at")
):
    client = RESTClient(base_url="https://api.example.com")
    
    for page in client.paginate(
        "/users",
        params={"updated_since": updated_at.last_value}
    ):
        yield page

Development Workflow

1. Start with DuckDB Destination

Fast local iteration:

pipeline = dlt.pipeline(destination="duckdb")

2. Test with Limited Data

source.with_limit(10)  # Test with 10 records

3. Verify Schema

# Check inferred schema
dlt pipeline <name> schema

4. Test Incremental Loading

Run pipeline twice to verify state:

# First run
pipeline.run(source)

# Second run - should only load new data
pipeline.run(source)

5. Deploy to Production

Common deployment options:

  • Airflow - Orchestration with DAGs
  • AWS Lambda - Serverless functions
  • Google Cloud Functions - Serverless on GCP
  • Any Python environment - dlt runs anywhere Python runs

See DEPLOYMENT.md for deployment patterns.

Troubleshooting

Authentication Failures:

  • Verify secrets in .dlt/secrets.toml
  • Check token expiration for OAuth
  • Ensure correct auth type (bearer vs api_key)

Pagination Issues:

  • Inspect API response structure
  • Verify paginator configuration matches API
  • Test with add_limit(10) first

Schema Evolution:

  • dlt handles schema changes automatically
  • Use schema contracts for strict validation
  • See SCHEMA_EVOLUTION.md

Performance:

  • Use incremental loading for large datasets
  • Configure appropriate batch sizes
  • Consider parallelization for independent resources

Best Practices

  1. Start Simple - Use rest_api_source() for standard APIs
  2. Use RESTClient for Custom Logic - When you need more control
  3. Always Use Secrets Management - Never hardcode credentials
  4. Test Locally with DuckDB - Fast iteration and debugging
  5. Implement Incremental Loading - For large or frequently updated datasets
  6. Choose Appropriate Write Disposition - Based on data mutability
  7. Specify Primary Keys - Required for merge disposition
  8. Use Test-Driven Development - Write tests for your pipelines
  9. Follow dlt Principles - No black boxes, respect existing workflows
  10. Leverage Built-ins - Use dlt's auth, paginators, and retry logic

See BEST_PRACTICES.md for comprehensive guidance.

Reference Files

Detailed documentation for specific topics:

Additional Resources

Contributing

Improvements welcome! Check the dlt repository for contribution guidelines:

  • Bugfixes and improvements are always welcome
  • New destinations unlikely to be merged (high maintenance cost)
  • Focus on SQLAlchemy destination for new database dialects
Install via CLI
npx skills add https://github.com/DecisioNaut/dlt-expert --skill dlt-expert
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator