name: 05-schema-validation
description: Runtime schema validation patterns for Gold layer merge scripts. Use when creating merge scripts to ensure DataFrame columns match target DDL schemas, validate column mappings before MERGE operations, and catch schema issues before deployment. Includes the DDL-first workflow, validate_merge_schema() helper, explicit column mapping, DDL schema reader, and pre-deployment validation script. Addresses the "three sources of truth" problem by enforcing DDL as runtime truth.
license: Apache-2.0
clients: [ide_cli, genie_code]
bundle_resource: jobs
deploy_verb: bundle_deploy
deploy_note: "Pattern consumed by the Gold merge jobs (notebook_task); deploy via bundle deploy --target dev (runDatabricksCli on Genie Code)."
coverage: full
metadata:
author: prashanth subrahmanyam
version: "1.0.0"
domain: gold
role: worker
pipeline_stage: 4
pipeline_stage_name: gold-implementation
called_by:
- gold-layer-setup
standalone: true
last_verified: "2026-02-19"
volatility: low
upstream_sources: []
Merge Schema Validation (Runtime)
Overview
Gold layer merge scripts must align perfectly with DDL-defined schemas. Mismatches between source DataFrame columns and target table schemas cause 59% of Gold layer bugs. This skill provides runtime validation patterns to catch schema issues before deployment.
Key Principle: DDL (setup script) is the runtime source of truth, not YAML design specifications.
Companion skill: For design-time cross-validation of YAML ↔ ERD ↔ Lineage, see design-workers/07-design-validation/SKILL.md.
When to Use This Skill
- Creating Gold layer merge scripts (Silver → Gold)
- Ensuring DataFrame columns match target table schemas
- Validating column mappings before MERGE operations
- Debugging
UNRESOLVED_COLUMNerrors in merge operations - Running pre-deployment schema validation
Core Problem: Three Sources of Truth
Gold layer development involves three schema definitions:
- YAML Design (
gold_layer_design/yaml/*.yaml) - Human-readable specifications - DDL Scripts (
src/gold/setup/*.py) - CREATE TABLE statements - Merge Scripts (
src/gold/merge/*.py) - DataFrame transformations
Problem: Manual transcription between these creates mismatches.
Common Failure Pattern
# YAML Design: user_email
# DDL Script: email (renamed during implementation)
# Merge Script: .select("user_email") # ❌ Still references old name!
# Runtime Error: [UNRESOLVED_COLUMN]
Root Cause: DDL was updated, but merge script wasn't.
Critical Rules
Rule 1: DDL is Runtime Source of Truth
Always develop merge scripts against actual created tables, not YAML designs.
Workflow:
1. YAML Design (Planning)
↓
2. DDL Creation (Reality) ← SOURCE OF TRUTH
↓
3. Run Setup Job (Create Tables)
↓
4. Validate Schema (Read actual table)
↓
5. Write Merge Script (Against actual schema)
↓
6. Pre-Deployment Validation (Compare DataFrame vs DDL)
Rule 2: Always Validate Before Merge
Use schema validation helper before every merge operation:
from helpers import validate_merge_schema
updates_df = prepare_transformations(silver_df)
validate_merge_schema(
spark, updates_df, catalog, gold_schema, "dim_user",
raise_on_mismatch=True
)
delta_gold.merge(updates_df, ...)
Rule 3: Explicit Column Mapping
Never assume Silver column names match Gold DDL. Always use explicit mapping:
updates_df = (
silver_df
.withColumn("store_id", col("store_number"))
.withColumn("product_id", col("upc_code"))
.select("store_id", "product_id", ...)
)
Quick Reference
Schema Validation Helper
validate_merge_schema(
spark: SparkSession,
updates_df: DataFrame,
catalog: str,
schema: str,
table_name: str,
raise_on_mismatch: bool = True
) -> dict
Returns validation result with:
valid: Boolean indicating if schemas matchmissing_in_df: Columns in DDL but not in DataFrameextra_in_df: Columns in DataFrame but not in DDLtype_mismatches: Columns with type differences
See references/validation-patterns.md for complete implementation with full code examples.
Pre-Merge Validation Checklist
Before writing any Gold merge script:
Pre-Development:
- Read YAML design to understand intent
- Run setup job to create actual tables
- Read DDL schema from actual table (
DESCRIBE table) - Document column mappings (Silver → Gold)
- Identify any renamed columns
During Development:
- Use explicit
.withColumn()for all renames - Use
.select()with explicit column list (no*) - Add comments documenting mappings
- Cast data types explicitly if DDL differs from source
- Include all DDL columns in final DataFrame
Pre-Deployment:
- Run schema validation helper
- Verify no extra columns in DataFrame
- Verify no missing columns from DDL
- Verify data types match DDL
- Test merge with small sample data
Core Patterns
Pattern 1: Schema Inspector Helper
Validate DataFrame against target table schema before merge:
from helpers import validate_merge_schema
validate_merge_schema(spark, updates_df, catalog, schema, table_name)
Benefits:
- Catches mismatches before Delta merge attempt
- Clear error messages with exact column differences
- Prevents cryptic Delta errors
- Validates data types, not just column names
Pattern 2: Explicit Column Mapping
Document and implement explicit Silver → Gold column mappings:
def merge_dim_store(spark, catalog, silver_schema, gold_schema):
"""
Column Mapping (Silver → Gold):
- store_number → store_number (same)
- company_rcn → company_retail_control_number (renamed)
- processed_timestamp → record_updated_timestamp (renamed)
"""
updates_df = (
silver_df
.withColumn("company_retail_control_number", col("company_rcn"))
.withColumn("record_updated_timestamp", col("processed_timestamp"))
.select("store_number", "company_retail_control_number", ...)
)
Pattern 3: DDL Schema Reader
Read actual table schema before writing merge script:
schema = spark.table(f"{catalog}.{gold_schema}.dim_user").schema
print([field.name for field in schema.fields])
updates_df = df.select("email") # Matches DDL, not YAML
Common Mistakes to Avoid
Mistake 1: Trusting YAML Over DDL
# ❌ BAD: Using YAML-defined column name
updates_df = df.select("user_email") # Doesn't match DDL!
Fix: Always verify against actual DDL:
# ✅ GOOD: Check actual table schema first
schema = spark.table("catalog.schema.dim_user").schema
updates_df = df.select("email")
Mistake 2: Implicit Column Selection
# ❌ BAD: Selecting with *
updates_df = silver_df.select("*") # Includes all Silver columns
Fix: Explicit selection:
# ✅ GOOD: Only select columns that exist in Gold DDL
updates_df = silver_df.select("column1", "column2", "column3")
Mistake 3: No Pre-Merge Validation
# ❌ BAD: Merge without validation
delta_gold.merge(updates_df, ...) # Fails at runtime with cryptic error
Fix: Validate first:
# ✅ GOOD: Validate before merge
validate_merge_schema(spark, updates_df, catalog, schema, table_name)
delta_gold.merge(updates_df, ...)
Reference Files
- Validation Patterns — Complete
validate_merge_schema()implementation, DDL schema reader, column mapping patterns, pre-merge validation workflows, error handling, and testing patterns - Validation Rules — Complete schema inspector implementation with error handling
- Validate Schema Script — Pre-deployment validation script for comparing all merge DataFrames against DDL schemas
Scripts
validate_schema.py
Pre-deployment validation script that compares all merge DataFrames against DDL schemas. Run before databricks bundle deploy to catch schema mismatches.
Usage:
python scripts/validate_schema.py \
jane_doe_catalog \
dev_jane_doe_system_gold
See scripts/validate_schema.py for complete implementation.
Related Skills
- Design Consistency Validation (Design):
design-workers/07-design-validation/SKILL.md— YAML↔ERD↔Lineage cross-validation during design - Grain Validation:
pipeline-workers/04-grain-validation/SKILL.md— Pre-merge grain validation for fact tables - Gold Merge Deduplication:
pipeline-workers/03-deduplication/SKILL.md— Deduplication patterns before MERGE
References
- Delta Lake Merge - Official documentation
- Rule Improvement Case Study - Complete Gold layer debugging methodology
Pattern Origin: 88 bugs across 7 Gold domains, 26% schema mismatch errors Key Lesson: DDL is runtime truth, not YAML. Always validate DataFrame against actual table schema before merge. Impact: Prevents 48% of schema-related bugs through pre-merge validation
Inputs
- Validated DataFrame (from
pipeline-workers/04-grain-validationor directly after deduplication for dimensions) - Target Gold table DDL schema (read via
spark.table(gold_table).schema) - Gold table name from YAML (
meta["table_name"])
Outputs
- Schema validation report: valid/invalid, missing columns, extra columns, type mismatches
- If valid: DataFrame is safe to MERGE
- If invalid: detailed error report with exact column differences
Pipeline Notes to Carry Forward
- DDL is runtime source of truth — not YAML design specifications
- Missing columns in DataFrame = likely a Silver-to-Gold mapping issue
- Extra columns in DataFrame = likely forgot to
.select()only Gold columns - Type mismatches = may need explicit
.cast()in column mapping
Next Step
If schema validation passes, execute the MERGE operation. If it fails, fix column mappings and re-validate before merging.