dlt

star 5

Master routing skill for data load tool (dlt). Use this to understand dlt rules and determine which specialized dlt workbench skill to invoke.

cianfhoghlaim By cianfhoghlaim schedule Updated 6/1/2026

name: dlt description: Master routing skill for data load tool (dlt). Use this to understand dlt rules, decide which sub-skill to invoke, and apply the Cianfhoghlaim dlt conventions (DuckLake/DuckDB destination, USE_LOCAL_SCRAPES offline fallback, relative imports only, type-safe BAML-driven pipelines, multi-destination fan-out to LanceDB / Memgraph / Graphiti, and Dagster dlt_assets wrapping).

DLT Master Router & Rules (Cianfhoghlaim)

You are operating within the cianfhoghlaim stack which uses dlt (data load tool) for extracting and loading data. This skill is the router + decision tree + project rules for all dlt operations.

1. Project rules (PRESERVED from the original skill, with one fix)

When assuming the data-engineer persona, use these rules:

  • Destinations: dlt.pipeline(..., destination="ducklake") for production (MotherDuck / DuckLake) or destination="duckdb" for local dev. Set USE_DUCKLAKE=true to switch to MotherDuck.
  • Tests: disable plugins during testing by setting DLT_DISABLE_PLUGINS=true.
  • Source location: dlt_sources lives at sruth/oideachais/dlt_sources/ (NOT sruth/oideachais/data_platform/dlt_sources/, which is a deprecated path mentioned in the old skill).
  • Imports: All oideachais.data_platform... absolute imports have been removed; use relative or local dlt_sources imports (e.g. from dlt_sources.ireland...).
  • Offline fallback: USE_LOCAL_SCRAPES=true routes extraction to stedding/ingest_queue/ (the curated local cache) instead of live web scraping (avoids API rate limits and credit drain).
  • Absolute namespaces (per project AGENTS.md): NEVER import oideachais.data_platform... from within the data platform — use relative imports.
  • Ingestion cache (per project AGENTS.md): Test with USE_LOCAL_SCRAPES=true before live web scraping to avoid API rate limits.

2. Decision tree → sub-skill or reference

When tasked with dlt operations or data exploration, use this guide to invoke the most appropriate resource:

Data exploration & notebooks (sruth/oideachais/notebooks)

  • explore-data: Use to analyze datasets and create an analysis_plan.md artifact
  • build-notebook: Use to assemble or regenerate a marimo notebook from an analysis_plan.md

Pipeline creation & maintenance

  • create-filesystem-pipeline: Use to build pipelines that read from local files. Highly relevant for the USE_LOCAL_SCRAPES offline fallback pattern
  • add-incremental-loading: Use to add state and incremental extraction to a filesystem pipeline
  • create-rest-api-pipeline: Use for generic REST / HTTP API sources
  • dlt-init-openapi (3rd-party): Use to auto-generate a verified dlt source from any OpenAPI spec

Type-safe pipelines (BAML → dlt)

  • baml-dlt-integration — see references/type-safe-pipeline.md for the canonical BAML → Pydantic → columns=... pattern

Destinations

  • LanceDB — see references/destinations-lancedb.md (the lancedb_adapter(source, embed=[...]) pattern)
  • Cognee + Memgraph — see references/destinations-cognee-memgraph.md (knowledge-graph destination)
  • Graphiti — see references/destinations-graphiti.md (temporal knowledge graph)

Performance & optimisation

  • Parallelised resources, add_limit, file rotation — see references/performance-optimisation.md

Dagster integration

  • @dlt_assets wrapping — see references/dagster-dlt-assets.md (the canonical pattern for scheduling DLT inside Dagster)

Transformations (dlt → dlt)

  • @dlt.transformer, SQL-based, Ibis aggregation — see references/dlt-transformations.md

Deployment

  • Dagster asset with scheduling — see the dagster skill
  • Serverless webhook (HTTP-triggered) — see references/deploy-gcp-cloud-function-webhook.md
  • Serverless scheduled — see references/deploy-modal.md
  • DAG-transformed downstream — hand off to the sqlmesh skill via sqlmesh init -t dlt --dlt-pipeline <name> dialect (see references/sqlmesh-init.md)

Search & RAG

  • Vectorise for RAG — see references/destinations-lancedb.md (the lancedb_adapter(source, embed=[...]) pattern)
  • Knowledge graph — see references/destinations-cognee-memgraph.md

OpenAPI source generation

  • dlt-init-openapi — see references/openapi-generator.md for auto-generated verified sources from any OpenAPI spec

Browser scraping

  • crawl4ai — see references/crawl4ai-dlt-summary.md (alternative to Firecrawl for JS-heavy sites)

3. Project-specific recipes (KCG patterns)

Type-safe pipeline (BAML → dlt → oRPC → MCP)

import dlt
from baml_client import b
from baml_client.types import PrimaryLearningOutcome  # auto-generated
from pydantic import BaseModel

class PrimaryOutcomeRow(BaseModel):
    """Mirror of the BAML class, with dlt column types."""
    stage: str
    curriculum_area: str
    learning_outcome: str

@dlt.resource(name="primary_outcomes", write_disposition="merge", primary_key=["stage", "curriculum_area", "learning_outcome"])
def primary_outcomes(pdf_path: str) -> list[PrimaryOutcomeRow]:
    """Extract primary learning outcomes from an NCCA PDF via BAML."""
    text = extract_pdf_text(pdf_path)
    outcomes = b.ExtractPrimaryLearningOutcomes(text)
    for o in outcomes:
        yield PrimaryOutcomeRow(
            stage=o.stage,
            curriculum_area=o.curriculum_area,
            learning_outcome=o.learning_outcome,
        )

pipeline = dlt.pipeline(
    pipeline_name="ireland_primary_curriculum",
    destination="ducklake",
    dataset_name="oideachais.education.ie",
)
load_info = pipeline.run(primary_outcomes("ncca_primary.pdf"))
print(load_info)

The BAML class is the single source of truth — both the Pydantic BaseModel and the dlt primary_key derive from it.

Multi-destination fan-out (DuckDB + LanceDB + Memgraph)

import dlt
from dlt.destinations import duckdb
from lancedb import lancedb_adapter
from cognee import add as cognee_add, cognify

@dlt.resource(name="curriculum_chunks")
def chunks(pdf_path: str):
    text = extract_pdf_text(pdf_path)
    for chunk in chunk_text(text):
        yield {"text": chunk, "source": pdf_path}

# Fan out to 3 destinations
pipeline = dlt.pipeline(destination="duckdb", dataset_name="curriculum")
load_info = pipeline.run([
    chunks("ncca.pdf"),                                    # → DuckDB
    lancedb_adapter(chunks("ncca.pdf"), embed=["text"]),    # → LanceDB
    cognee_destination(chunks("ncaa.pdf")),                 # → Cognee
])

One pipeline run, three destinations. State is unified (one pipeline.last_trace, not three).

Dagster asset wrapping (@dlt_assets)

from dagster_dlt import DagsterDltResource, dlt_assets
import dlt

@dlt_assets(
    dlt_source=ireland_curriculum_source(),
    dlt_pipeline=dlt.pipeline(
        pipeline_name="ireland_curriculum",
        destination="ducklake",
        dataset_name="oideachais.education.ie",
    ),
)
def ireland_curriculum_assets(context, dlt_run_resource: DagsterDltResource):
    yield from dlt_run_resource.run(context=context)

# Schedule
@schedule(cron_schedule="0 2 * * *", job=ireland_curriculum_assets_job)  # 02:00 UTC daily
def ireland_curriculum_schedule(): ...

See references/dagster-dlt-assets.md for the full pattern with multiprocess_executor, parallel assets, and incremental loading.

4. Performance & anti-patterns

Do:

  • Use parallelized=True for any resource fetching > 1k rows
  • Use add_limit(N) to cap a source for testing
  • Use file rotation (dlt.pipeline(..., progress="log") + chunked writes) for > 1M rows
  • Use write_disposition="merge" with an explicit primary_key for upserts
  • Use columns=PydanticModel for type-safe pipelines (the BAML pattern)
  • Pre-validate inputs before the API call (catches bad PDFs early)

Don't:

  • Fetch all data in a single fetch_all() call (OOM risk for > 1M rows)
  • Use write_disposition="merge" without a primary_key (silently appends duplicates)
  • Import oideachais.data_platform.dlt_sources from within sruth/oideachais/ (use relative imports)
  • Hand-write DDL for the destination (let dlt infer the schema from the resource yield)
  • Run live web scraping without USE_LOCAL_SCRAPES=true first (drains API credits and risks rate limits)
  • Add a BAML client inline in a function (use a named client in baml_src/clients.baml)

5. Reference index

The 3 reference files in references/ that were previously orphaned are now linked from here:

New references (added by the sync-skills-from-docs change):

6. Cross-references

  • The dlt skill is consumed by: data-engineer agent (the primary user)
  • The dlt skill collaborates with: baml skill (type-safe pipelines), dagster skill (@dlt_assets wrapping), sqlmesh skill (sqlmesh init -t dlt), lancedb skill (vector destination), cognee skill (knowledge-graph destination), motherduck skill (MotherDuck destination)
  • The dlt skill feeds into: explore-data (analysis_plan.md) and build-notebook (marimo notebooks) for downstream visualisation

7. Examples

See ./examples/ for upstream dlt reference

2026-06 updates (from the upstream-package-monitoring openspec change)

  • dltHub Pro launched 2026-04-14. The Pro tier adds 9,700+ known source contexts that DLT can pull from in one call. The KCG dev plan is tracked by openspec/changes/dlt-pro-source-registry/.

  • Cortex Code (Snowflake's AI assistant, launched ~9 weeks before dltHub Pro) integrates directly with the dlt Pro source registry.

  • ADE-Bench (the AI data-engineer benchmark) reported 65% task success on Snowflake via Cortex Code vs 58% for Claude Code. The paper's key finding: "without the workbench, the agent leaked credentials" — directly validates KCG's strict-secret-hydration mandate (see docs/secrets/secrets_management_plan.md for the Infisical + Locket + mise three-way contract).

  • dlthub upstream monitordlthub_blog.yml in infrastructure/firecrawl/monitors/upstream_packages/ is the Firecrawl monitor that detects new source-context additions, ADE-Bench results, and Cortex Code integration updates via the LLM-judge --goal filter. The n8n workflow infrastructure/stacks/n8n/workflows/upstream-blog-monitor.json writes the payload to s3://oideachais-upstream-webhooks/dlthub/...jsonl and triggers the Dagster asset upstream_blog_monitor_ingest. notebooks:

  • data_engineering_dlt_small-data-sf-2025_elvis.ipynb — Dremio "small data" workshop (SF 2025): dlt pipelines for small files, REST API ingestion patterns, and DuckDB destination examples.

Install via CLI
npx skills add https://github.com/cianfhoghlaim/kings_college_galway --skill dlt
Repository Details
star Stars 5
call_split Forks 1
navigation Branch main
article Path SKILL.md
More from Creator
cianfhoghlaim
cianfhoghlaim Explore all skills →