data-modeling-dimensional

name: data-modeling-dimensional description: > Dimensional data modeling guide for the Fabricks platform — covers the full pipeline from staging through raw, transf, core, and semantic layers; fact and dimension table design; key strategy; SCD types; bridge tables; degenerate dimensions; and config. Use whenever designing or reviewing fact/dimension tables, deciding where logic belongs in the pipeline, choosing SCD1/SCD2, constructing business/surrogate keys, modeling many-to-many relationships, or understanding layer architecture. Trigger on: "design fact table", "design dimension", "SCD type", "slowly changing dimension", "surrogate key", "business key", "bridge table", "layer architecture", "transf vs core", "udf_identity", "udf_key", "__valid_from", "__is_current", "new dim table", "new fact table", "data model", "bronze", "silver", "gold", "semantic layer", "staging".

Dimensional Data Modeling — Fabricks Platform

This guide describes how dimensional modeling works in Fabricks.Runtime, traced from the production sales_order pipeline. It reflects how the platform actually works — not generic theory.

Full Pipeline Architecture

Every entity flows forward through six stages. Never build backward dependencies.

Source System (SQL Server / ERP)
  ↓
bronze/staging/          ← Extract to ADLS via notebooks (_config.onetrade.yml)
  ↓
silver/raw/              ← Register as Delta tables with CDC metadata
  ↓                        (raw.onetrade_stordh, raw.onetrade_storrh, ...)
gold/transf/             ← Business logic, consolidation, historization
  ├─ scd2/              ← Full history: __valid_from / __valid_to per record
  ├─ merger/             ← Union + enrich multiple raw sources (40+ ref joins)
  ├─ fact/              ← Complex calculations: margins, costs, flags, dates
  └─ dim/               ← Lightweight attribute lookups (current-only)
  ↓
gold/core/               ← Conformed layer — all surrogate keys resolved
  ├─ dim/               ← Final dims: __identity, __key, __is_current
  └─ fact/              ← Final facts: all *_sk FKs, 300+ columns
  ↓
semantic/semantic/       ← BI layer — rename, filter, partition, display UDFs
  ├─ fact/
  └─ dim/

What Each Layer Does

Bronze / Staging

Purpose: Extract source data to the data lake.

Databricks notebooks query source systems (SQL Server, APIs, etc.)
Config-driven: one _config.<source>.yml per source system, one entry per source table
Output: Delta files in Azure Data Lake — no SQL table definitions at this stage
Column names at this stage are raw/abbreviated as they come from the source

Silver / Raw

Purpose: Register extracted Delta files as queryable tables with CDC metadata.

Config-driven registration — no CREATE TABLE SQL written by hand
Mode: change_data_capture: scd1 (register current state, track changes)
Every raw table gains framework metadata columns:

__valid_from    -- when this version became active
__valid_to      -- when this version was superseded
__is_current    -- true for the active row
__is_deleted    -- soft delete from source CDC
__key           -- source business key

Source column names are raw/abbreviated (e.g., ordnr, kltnr, artnr, vrmant). No renaming or business logic at this layer.

Gold / Transf / SCD2

Purpose: Capture full attribute history for slowly changing entities.

Joins header + line tables using temporal validity windows
Produces one row per version of a record (not one row per current record)
Adds __operation (upsert/delete) and preserves __valid_from/__valid_to
Output: transf.scd2_<entity> tables
Used as input to merger/ — not consumed directly by core/

Gold / Transf / Merger

Purpose: Consolidate multiple raw sources, rename columns, enrich with reference data.

This is where source abbreviations become business names (ordnr → order_id).

UNION ALL of all source variants for the same entity
Adds __source column to identify origin (e.g., "sws", "onprem", "calenso")
Joins to 40+ reference tables to add derived attributes:
- article master → product_group_l1_id, brand_id, supplier_id
- customer master → customer_to_bill_id, cluster_id
- business unit mapping → business_unit_id
Adds derived flags: is_transport, is_special_order, is_direct_order
Output: transf.merger_<entity> (~190 columns for sales_order)

No surrogate keys yet. Business IDs only.

Gold / Transf / Fact

Purpose: All complex business calculations. The heaviest layer.

Multi-CTE pipeline (450+ lines for sales_order)
Typical CTEs: stock transfer correction → liquidation flags → date corrections → account assignment → cost calculations → margin → due date logic → bonus/IVP
Adds measures that require cross-joining multiple reference tables:
- account_profit_id, account_loss_id (depends on order_type + article class)
- standard_cost, margin (supplier price lookup)
- due_date, overdue_days, is_due_date_compliant (working-day calendar logic)
- is_order_backlog, is_order_invoiced (status-code logic)
Output: transf.fact_<entity> (~170 columns)

Still no surrogate keys. All joins use business IDs.

Gold / Transf / Dim

Purpose: Lightweight current-state lookups that don't need full history.

Simple deduplication + attribute selection
Uses QUALIFY row_number() over (...) = 1 to get one row per key
Output: small dimension tables used by core/fact/ for enrichment

Gold / Core / Dim

Purpose: Final conformed dimensions with stable surrogate keys.

Adds __identity (surrogate key) via udf_identity_v2()
Adds __key (composite business key string)
SCD2 tables split into dim_<entity> (current-only) and dim_<entity>_hist (all versions)
__is_current = true on current row; historical rows accessible via _hist table

Gold / Core / Fact

Purpose: Resolve all business keys to surrogate keys. Final analytical table.

Joins transf.fact_<entity> to all dimension tables
Converts every *_id → *_sk using udf_identity_v2()
Adds temporal joins for SCD2 dimensions (join on ID + date range)
Adds any remaining aggregates that require cross-row logic (e.g., days_to_status_change)
Validated by min_rows: check in config
Output: core.fact_<entity> (300+ columns for sales_order)

Semantic / Semantic

Purpose: BI-facing presentation layer only. Zero business logic.

Filters to relevant records (e.g., invoiced orders only, last 5 years or backlog)
Renames columns to display names (with backtick-quoted spaces: `Order Nr`)
Applies display UDFs: udf_combi_field(), udf_partition_from_date_sk_v2()
Conditionally nulls keys that don't apply: if(is_invoiced, sales_date_sk, null)
Declares partitioning: partition_by: [partition_year, _partition_v2]
Declares parents: dependencies in config for correct build order

Rule of Thumb: Where Does This Logic Belong?

What you're doing	Layer
Source extraction config	`bronze/staging/`
Registering a new source table	`silver/raw/` config
Track full attribute history	`gold/transf/scd2/`
Union multiple source variants	`gold/transf/merger/`
Rename source columns to business names	`gold/transf/merger/`
Enrich with reference/lookup data	`gold/transf/merger/` (if simple) or `gold/transf/fact/` (if complex)
Calculate margins, costs, dates, status flags	`gold/transf/fact/`
Lightweight current-state lookup	`gold/transf/dim/`
Resolve surrogate keys	`gold/core/fact/`
Join to conformed dimensions	`gold/core/fact/`
Conformed dim with surrogate key	`gold/core/dim/`
Column renames, BI filters, partitioning	`semantic/semantic/`

Put logic in the earliest layer where all required data is available. If you find yourself adding a join in core/ that isn't for key resolution, it probably belongs in transf/fact/.

Key Strategy

Three key types exist for every dimension entity. Understand all three before building.

1. Business Key (`_key` in dim, `_bk` in fact)

The natural identifier from the source system. Built using udf_key() when composite.

-- Simple: single source field
cu.no_ as __key                          -- in transf/scd2/customer.sql

-- Composite: multiple fields joined with * separator, nulls become -1
udf_key(array(
    fact.article_id::string,
    fact.row_type_id::string
)) as article_key                        -- in core/fact/sales_order.sql

Stored on the fact table as <dim>_bk for debugging (e.g., item_bk, customer_bk)
If the business key is a concatenation of multiple source fields, keep it denormalized on the fact — don't collapse it
Business keys in dimensions are named _key (e.g., item_key, cost_center_key)

2. Surrogate Key (`_sk` / `__identity`)

The numeric foreign key used for joins and aggregation. Generated by udf_identity_v2().

udf_identity_v2(fact.article_key, 'article') as article_sk
udf_identity_v2(fact.customer_id, 'customer') as customer_sk

udf_identity_v2(key, table_name) dispatches differently per entity — some use numeric offset ranges, others hash, others have special logic (dates, clusters). Never compute surrogates manually — always use the UDF so the range strategy is consistent.

Surrogate key ranges (deliberate, not random):

Entity	Range	Strategy
`customer`	1–150,000	`udf_identity_numeric(key, 150000)`
`article`	1–4,390,000	`udf_identity_numeric(key, 4390000)`
`date`	YYYYMMDD format	`date_format(key, "yyyyMMdd")`
`time`	HHmm format	`date_format(key, "HHmm")`
Default	xxhash64	`xxhash64(key)`

Ranges are deliberate — they let you infer entity type from a surrogate value, which helps debug cross-table join issues.

3. Historized Surrogate Key (`_hsk`)

Points to a specific historical version of a SCD2 dimension record. Used when the fact needs to reflect what the dimension looked like at a point in time.

udf_identity_v2(deh.__identity, 'employee_hist') as employee_hist_sk

Use _hsk (not _sk) as the suffix when the join is temporal. This makes it clear that the FK points into a historized table, not the current-state view.

Sentinel Values

Value	Meaning
`-1`	Business key was NULL in the source — unknown/unmatched dimension member
`-2`	Surrogate key join failed — dimension record does not exist yet
Special positive value	Hardcoded for specific business cases (e.g., `business_unit_sk = 5` for transport orders)

Use udf_sentinel(business_key, surrogate_key) for optional dimension joins:

udf_sentinel(concat(fact.article_id, fact.location_id), asi.__identity)
    as article_stock_information_sk

Fact Table Design

Grain

Define the grain before writing a single line of SQL. The grain is the lowest level of detail in the fact — one row represents exactly one thing (e.g., one sales order line). Every measure and dimension must be consistent with that grain.

What goes on a fact table

Column type	Example	Notes
Surrogate keys	`customer_sk`, `article_sk`	One per dimension — used for joins in BI
Historized SK	`employee_hist_sk`	When dimension is SCD2 and temporal accuracy matters
Business keys	`customer_bk`, `item_bk`	Keep for debugging alongside surrogate
Degenerate dimensions	`row_type_id`, `order_status_id`	Low-cardinality operational codes — no dim table needed
Measures	`sales_amount`, `cost_amount`	Additive facts
Flags	`is_inter_company`, `is_transfer_order`	Boolean, `is_` prefix
Date keys	`sales_date_sk`, `creation_date_sk`	Integer, joins to `dim_date`

Surrogate key resolution pattern (in `core/fact/`)

-- Step 1: carry the business key forward from transf
fact.customer_id,

-- Step 2: build composite key if needed
udf_key(array(fact.article_id::string, fact.row_type_id::string)) as article_key,

-- Step 3: resolve to surrogate
udf_identity_v2(fact.customer_id, 'customer') as customer_sk,
udf_identity_v2(fact.article_key, 'article') as article_sk,

-- Step 4: LEFT JOIN to dim for attributes you need on the fact (rare)
left join core.dim_article da
    on da.id = fact.article_id
    and da.row_type_id = fact.row_type_id

All surrogate key resolution happens in core/fact/, not in transf/ or semantic/.

Temporal joins for SCD2 dimensions

When a fact must reflect the dimension state at the time of the transaction:

left join core.dim_employee_hist deh
    on deh.login = fact.sales_agent_login
    and fact.sales_date between deh.__valid_from and deh.__valid_to

Dimension Table Design

SCD Type 1 — current state only

Use when historical accuracy is not required (e.g., reference data, lookup tables). Built in transf/dim/ and core/dim/. One row per business key. The __identity is stable across loads.

-- core/dim/ structure
__key           -- business key string
__identity      -- surrogate key (bigint)
id, name, ...   -- attributes (current values only)

SCD Type 2 — full history

Use when attribute changes must be tracked over time (customers, articles, employees, pricing). Built in transf/scd2/. Every version of a record gets its own row.

-- transf/scd2/ output structure
__key           -- business key
__identity      -- surrogate for this specific version
__valid_from    -- start of this version's validity (timestamp)
__valid_to      -- end of this version's validity (timestamp)
__is_current    -- true for the active version
__operation     -- 'upsert' or 'delete' (CDC)
__is_deleted    -- soft-delete flag
id, name, ...   -- attributes (values as of this version)

Contiguous date ranges: Every SCD2 dimension must have no gaps between __valid_from and __valid_to. If no data exists yet, create a dummy record at the known start date to ensure the range is contiguous.

Semantic layer filter: Semantic dim views filter to __is_current = true so BI tools see a simple current-state table. The historized table (dim_*_hist) is available separately for temporal analysis.

Bridge Tables (Many-to-Many)

Use a bridge table when an entity has a many-to-many relationship with the fact (e.g., an article can have multiple suppliers; a sales order line can link to multiple articles per supplier).

-- bridge: dim_article_supplier
article_id, supplier_id, __identity, ...attributes

-- on the fact: sentinel handles the "no supplier for this article" case
udf_key(array(fact.article_id::string, fact.location_id::string))
    as article_stock_information_key,
udf_sentinel(concat(fact.article_id, fact.location_id), asi.__identity)
    as article_stock_information_sk,

left join core.dim_article_stock_information asi
    on asi.article_id = fact.article_id
    and asi.location_id = fact.location_id
    and fact.row_type_id = 0   -- bridge only valid for certain row types

Degenerate Dimensions

Store low-cardinality operational codes directly on the fact. Do not create a dimension table for them.

Use degenerate dimensions when:

Small, stable set of values (< ~20)
Operational metadata, not analytical attribute
Not useful as a filter or group-by on its own in BI

-- Good degenerate dimensions on fact_sales_order:
row_type_id        -- 0=regular, 3=service, 4=complement
order_status_id    -- operational status code
sales_group_id     -- code, not a user-facing attribute

Config Files

Every layer folder has a _config.*.yml that controls job execution. Each layer uses slightly different options.

# gold/transf/merger/_config.merger.yml
- job:
    step: transf
    topic: merger
    item: sales_order
    options:
      mode: complete
    check_options:
      post_run: true

# gold/transf/fact/_config.fact.yml
- job:
    step: transf
    topic: fact
    item: sales_order
    options:
      mode: complete
    tags: [sales_order]
    check_options:
      post_run: true

# gold/core/fact/_config.fact.yml
- job:
    step: core
    topic: fact
    item: sales_order
    options:
      mode: complete
    tags: [sales_order]
    check_options:
      post_run: true
      min_rows: 68818365      # fail if row count drops below this

# semantic/semantic/fact/_config.fact.yml
- job:
    step: semantic
    topic: fact
    item: sales_order
    tags: [sqlai_sales]
    options:
      mode: complete
      parents:
        - core.fact_sales_order_invoiced   # explicit dependency
    check_options:
      post_run: true
      min_rows: 100000
    table_options:
      partition_by:
        - partition_year
        - _partition_v2

Option	Values	When to use
`mode: complete`	full rebuild every run	Fact tables, most dimensions
`mode: update`	incremental merge	Large SCD2 tables with CDC
`change_data_capture: scd1`	merge current state	Type 1 dims
`change_data_capture: scd2`	insert new versions	Type 2 dims
`tags`	list of strings	Dependency ordering across jobs
`parents`	list of table names	Explicit upstream dependency declaration
`min_rows`	integer	Fail the job if output drops below this count
`post_run`	true/false	Run `<item>.post_run.sql` after load

Post-run SQL (.post_run.sql) runs after the main load — use for reconciliation checks, derived metrics, or index hints.

Metadata Columns

Column	Type	Meaning
`__key`	string	Business key (composite with `*` separator)
`__identity`	bigint	Surrogate key
`__valid_from`	timestamp	Start of record validity (SCD2)
`__valid_to`	timestamp	End of record validity (SCD2)
`__is_current`	boolean	True for the current/active version (SCD2)
`__is_deleted`	boolean	Soft-delete from source CDC
`__operation`	string	`upsert` or `delete` (CDC type)
`__source`	string	Source system identifier (e.g., `"sws"`, `"onprem"`)
`_metadata`	struct	Framework-managed audit metadata — do not overwrite

Semantic Layer Rules

The semantic layer is purely a presentation layer. It must not contain business logic.

What the semantic layer does:

Renames columns to user-friendly display names (with backticks for spaces)
Filters rows (e.g., WHERE is_order_invoiced = true, exclude inter-company)
Applies display UDFs: udf_combi_field(id, name) → "Geberit Vertriebs AG (00008465)"
Applies partitioning: udf_partition_from_date_sk_v2(sales_date_sk)
Filters SCD2 dims to __is_current = true
Conditionally nulls out keys when they don't apply: if(is_invoiced, sales_date_sk, null)

What the semantic layer does NOT do:

Join to other tables
Calculate measures
Apply business rules

Common UDFs Reference

UDF	Purpose	Example
`udf_key(array<string>)`	Build composite business key, `*` separator, `-1` for nulls	`udf_key(array(article_id::string, row_type_id::string))`
`udf_identity_v2(key, table)`	Resolve business key to surrogate	`udf_identity_v2(customer_id, 'customer')`
`udf_sentinel(bk, sk)`	Handle nullable dim joins: -1 if bk null, -2 if sk null	`udf_sentinel(fact.location_id, loc.__identity)`
`udf_combi_field(id, name)`	Semantic display: `"Name (ID)"`	`udf_combi_field(customer_id, customer_name)`
`udf_partition_from_date_sk_v2(date_sk)`	Convert date_sk to partition column	In semantic layer only
`udf_time_identity(timestamp)`	Get numeric time key	For time dimension FK