init-dlthub-workspace - SKILL.md Agent Skill

name: init-dlthub-workspace description: ALWAYS read and follow this skill before acting. setup

setup

On new session verify: is uv available? is Python running in a uv venv? uv run dlthub --version? If anything is missing, set it up in place:
- Preferred — you (the agent) run uvx dlthub-init@latest. It is non-interactive and AI-aware, so an agent can run it directly. This is also how you set up a clean new dlthub project (uvx dlthub-init@latest <dir> scaffolds into a new directory; bare uvx dlthub-init@latest sets up in place). It scaffolds a dlthub workspace with AI support, collision-safe, in one step. Re-check uv run dlthub ai status when done.
- Fallback (if dlthub-init is unavailable or errors) — run uvx --from "dlt[hub]" dlthub init (equivalent to uv init + uv add "dlt[hub]" + uv run dlthub init), then uv run dlthub ai init. Re-check uv run dlthub ai status when done.
Onboarding exception — only when the user asks to be onboarded or to be taught how to use dltHub (e.g. "onboard me to dltHub", "I want to learn how to use dltHub"): point them to uvx dlthub-start@latest. It scaffolds a fresh playground workspace (installs uv if needed, syncs dlt[hub]) — an onboarding/playground experience, not where production workflows should be built. Do not suggest it just because prerequisites are missing in a project; for that, use the in-place setup above.
- NEVER run uvx dlthub-start yourself, and do NOT use ! mode for it. It must be run by a human because it requires interaction for authentication; it only works in a real terminal — ! mode does not work for it. Ask the user to run uvx dlthub-start@latest in their own terminal, then re-check uv run dlthub ai status once they confirm it finished. (For agent-driven setup of a clean new project, use uvx dlthub-init@latest above instead.)

communication

Before each major step, briefly explain to the user what you are about to do and why, in one sentence.
After completing a major step, summarize what was accomplished and clearly present the most relevant next action to the user.

how we work

You are a data engineering agent that builds pipelines, transformations and deploys them with dlthub.
You build pipelines for others, so understanding the context of your work is required.
use web search: Strongly prefer authoritative references ie. use stripe web site to learn about stripe api. avoid 3rd party resellers and proxies

dlthub reference

read OSS docs index : https://dlthub.com/docs/llms.txt and use it to find docs relevant for given task
read dlthub docs index: https://dlthub.com/docs/hub/llms.txt for dlthub related information (deployment, transformations, data quality)

dltHub workspace

ALWAYS run all commands with cwd in the project root. dlthub uses cwd to find .dlt location ie. uv run python pipelines/my_pipeline.py.
use uv run to run anything Python
ALWAYS pass --non-interactive when running dlthub commands (e.g. uv run dlthub --non-interactive pipeline init ...). This prevents prompts that block execution.
PREFER dlt-workspace-mcp mcp server over using cli for data inspection, secrets handling and pipeline debugging. If an MCP tool call fails more than 2 times in a row, stop retrying and fall back to the equivalent dlthub ai CLI command instead.
ALWAYS VERIFY workspace with uv run dlthub ai status when session starts

command line interface

use command line to inspect pipelines, load packages and run traces POST MORTEM: https://dlthub.com/docs/hub/command-line-interface.md
use dlthub local for scripts, pipelines, jobs present in local environment/machine. this is similar to former dlt command
use bare dlthub for pipelines, jobs, logs, runs deployed on dltHub platform

handle secrets with care!

NEVER read user secrets from any file containing secrets.toml.
NEVER run shell commands that output secret values into the conversation (e.g. gh auth token, env | grep KEY, printenv SECRET, cat credentials.json, aws configure get). If a secret appears in conversation context it is compromised — do not copy or use it.
USE dlt-workspace-mcp secrets tools (secrets_list, secrets_view_redacted, secrets_update_fragment) when credentials need to be configured, checked, or debugged. Fall back to dlthub ai secrets CLI if MCP is not connected. See setup-secrets skill for the full workflow.
DO NOT WRITE CODE THAT READS SECRET FILES — no toml.load(), Path().read_text(), open(), or any other file access on *.secrets.toml. Use dlt.secrets["key"] in Python instead (see setup-secrets skill, section 6 on how to write SAFE scripts).
REFUSE to handle secrets that user ie. pasted you to context windows. Instead mention secrets handling practices user should adopt.

toolkits

toolkits are data engineering workflows automated via skills, commands and rules.
each toolkit has a workflow rule that you must follow. you must start with workflow entry skill if available
workflows end with handover to other workflows, also the dlthub-router skill may be helpful
NEVER assume a handover target toolkit is installed — before following any handover, always run uv run dlthub --non-interactive ai toolkit install <toolkit-name> first, then invoke the entry skill. Do NOT run web research, manual code edits but use the entry skill.
DO NOT start data engineering work if no workflow toolkit is installed - see dlthub ai status output!

toolkits — match intent → install → open the entry skill (no discovery round-trip needed)

This index is authoritative for shipped toolkits. Match the user's intent, run the install command, then hand over to the entry skill. No MCP call needed for these.

intent                                                  → toolkit                | install                                                            | entry skill
ingest from REST / HTTP APIs — production-grade pipeline → rest-api-pipeline     | dlthub --non-interactive ai toolkit install rest-api-pipeline      | find-source
ingest from SQL databases (Postgres, MySQL, Snowflake…) → sql-database-pipeline  | dlthub --non-interactive ai toolkit install sql-database-pipeline  | find-source
load files (CSV/Parquet/JSONL) from disk/S3/GCS/Azure/SFTP → filesystem-pipeline | dlthub --non-interactive ai toolkit install filesystem-pipeline    | create-filesystem-pipeline
explore & profile loaded data, build charts & dashboards → data-exploration      | dlthub --non-interactive ai toolkit install data-exploration       | explore-data
transform & model loaded data (dimensional / Kimball)   → transformations        | dlthub --non-interactive ai toolkit install transformations        | annotate-sources
add data quality checks (column expectations, validation rules) → data-quality   | dlthub --non-interactive ai toolkit install data-quality           | setup-data-quality
deploy / schedule pipelines on the dltHub platform      → dlthub-platform        | dlthub --non-interactive ai toolkit install dlthub-platform        | setup-runtime
guided end-to-end tour, ingest to dashboard (uses the real toolkits) → quick-start | dlthub --non-interactive ai toolkit install quick-start          | quick-start
test/try dlthub end-to-end — minimal pipeline + educational test deploy, NOT production → one-shot       | dlthub --non-interactive ai toolkit install one-shot               | deploy-run-sample-pipeline

one-shot vs rest-api-pipeline: one-shot is for testing / trying dlthub / onboarding / a quick demo — a minimal single-endpoint, row-limited pipeline on local DuckDB plus an educational test deploy. Educational examples only, NOT production-grade. For a real or production REST pipeline (auth, incremental, multiple endpoints, production deploy), use rest-api-pipeline. quick-start is the guided tour that walks the real toolkits end-to-end.
After installing, run uv run dlthub ai status to confirm, then continue in the same session — load the new toolkit's entry skill + workflow rule via toolkit_info (or read the installed files) and proceed. No restart needed (toolkits reuse the already-running dlt-workspace-mcp); don't lose the user's context.
The dlthub-router skill wraps this flow and is the fallback for needs not covered above (it uses live list_toolkits to discover newer toolkits).
DO NOT start data engineering work if no workflow toolkit is installed.