name: dbt-workflow description: "Load at Step 1 before exploring the project. Covers output shape inference, incremental model handling, and what to trust in YML." type: skill
dbt Workflow Skill — Explore and Plan
0. Load Knowledge Base Context FIRST
Project-specific conventions, decisions, and quirks live in the Knowledge Base. Always consult before exploring the project.
Step 0a — get_knowledge
Call once at the start of every task with a 1-line task_description. Returns the always-loaded baseline (org/project understanding + conventions) plus up to 5 task-relevant decisions/debugging/quirks. Treat the returned ## title blocks as authoritative for naming, grain, and known traps.
Step 0b — search_knowledge(query=...)
Call when you hit something unexpected — column meaning unclear, ambiguous join, surprising row count. Pass a 2–4 word query. It is a pure read — no side effects.
Step 0c — propose_knowledge
Call ONLY after you have completed work and verified a finding. Use it to record:
category="decisions"for choices made (auto-accepted).category="debugging"for root-cause traps you hit and resolved (auto-accepted).category="quirks"(scope=connection) for connector/dialect oddities (auto-accepted).- Do NOT propose
understanding— humans only. Do NOT proposeconventionsordomain-rulesas part of automated runs unless explicitly asked (these queue for human review). - Title must be a slug (
^[a-z0-9-]+$, ≤120 chars). Body is markdown. - On duplicate-title: re-call with
overwrite=trueonly if the prior doc is genuinely outdated.
What NOT to do
- Do not paste raw KB text back into model SQL comments — reference the doc title instead.
- Do not call
propose_knowledgemid-exploration — only after success.
1. Output Shape — Read YML Description BEFORE Writing SQL
Extract from description: field:
- ENTITY: "for each customer/driver/order" → one row per qualifying entity
- QUALIFIER: "due to returned items" / "with at least one order" → filter or INNER JOIN
- RANK CONSTRAINT: "top N" / "ranks the top N" → exactly N output rows. Filter
with
ROW_NUMBER() ... <= Nusing a deterministic tiebreaker (add primary key to ORDER BY). Do NOT use DENSE_RANK for filtering — it can return more than N rows. - TEMPORAL SCOPE: "rolling window", "MoM", "WoW", or "month-over-month" in the
description → ONE output date (latest), not all historical dates. Filter with
WHERE date_col = (SELECT MAX(date_col) FROM source). - PERIOD-OVER-PERIOD: If the description mentions MoM, WoW, YoY comparisons
AND you are writing this model from scratch (stub/missing), the comparison column
must be
CAST(NULL AS DOUBLE)— see rule below.
How to read YML descriptions: Descriptions tell you what the data MEANS, not what code to write. Use them to:
- Identify which source columns to use (e.g. "starting from first position on
the grid" → use the
gridcolumn, not qualifying position) - Understand the business meaning of each column
- Pick the right aggregation logic
But do NOT treat descriptions as literal computation instructions. They may describe steady-state behavior that doesn't apply on first build, or use imprecise language. After reading the description, always verify your logic against the actual source data — query the source tables to confirm which columns and values produce the expected result.
Write at top of SQL: -- EXPECTED SHAPE: <row count or formula> — REASON: <quote>
1b. Snapshot Reference Tables BEFORE Building
The starting database contains pre-computed reference tables with correct output.
dbt run will overwrite them. Before your first dbt run, for each target
model that already exists as a table in the database:
SELECT COUNT(*) FROM <model_name>
Record the row count in your -- EXPECTED SHAPE comment. If your rebuilt model's
row count doesn't match after dbt run, you MUST diff against this reference to
find which rows differ.
2. Incremental Models and Period-Over-Period Columns
When a dbt project uses materialized="incremental" models, the project is
designed to accumulate state over multiple runs. On a first run (full refresh,
no prior state), incremental models build from scratch.
If you are writing a new model that includes period-over-period metrics (MoM, WoW, YoY) and the project has not been run incrementally before:
- Output rows for the latest date only:
WHERE date_col = (SELECT MAX(date_col) FROM source) - Period-over-period columns must be
CAST(NULL AS DOUBLE)— there is no prior aggregated state to compare against. Computing these from raw historical data would produce values that don't match the expected first-run output.
If the model SQL already exists (not a stub):
- Read the
{% if is_incremental() %}block to understand the filter logic. - The code outside that block runs on full refresh.
3. What to Trust in YML
Trust YML for: column names (exact match required), column descriptions (what each column represents), ref dependencies (what tables to join).
YML not_null tests on key/dimension columns (IDs, names, dates, categories)
imply a WHERE col IS NOT NULL filter on input data. Do NOT implement this as an
INNER JOIN — use an explicit WHERE clause. not_null on metric/aggregate columns
(counts, averages, totals) just asserts the output shouldn't be NULL — don't filter
inputs for those, fix the aggregation instead.
Do NOT trust YML for: grain/row count. YML unique and not_null tests are
assertions that may be aspirational or wrong. Do NOT use not_null tests to decide
join type.
Derive the grain from these signals (in priority order):
Unique key structure: If the YML defines a unique key or surrogate key column, examine what it's composed of. A key like
concat(ticker, timestamp)means the grain is (ticker, timestamp) — not (ticker, date). The key tells you exactly what combination of values identifies one row.Column list: The columns themselves reveal the grain. If a model has both a header-level key AND a detail-level key as separate columns, the grain is at the detail level.
Upstream model grain: Check existing upstream models that feed into yours. If
bar_executionsproduces one row per (ticker, timestamp), your model that depends on it likely has the same or coarser grain — not finer.Source cardinality: Before writing SQL, query the source tables to check how many rows your model should produce:
SELECT COUNT(DISTINCT key_col) FROM source_tableIf your model produces dramatically fewer rows than upstream, your GROUP BY is too coarse.Sibling model row counts: Check complete models at the same level.
Do NOT deduplicate with ROW_NUMBER to force a unique test to pass — if the
data naturally has multiple rows per key, keep them all.