clickzetta-dw-modeling - SKILL.md Agent Skill

name: clickzetta-dw-modeling description: | ClickZetta Lakehouse data warehouse modeling wizard. Autonomously explores data landscape, then presents justified recommendations — never asks questionnaire-style questions. Covers three layering patterns: Traditional DW (ODS/DWD/DWS/ADS), Medallion (Bronze/Silver/Gold), and hybrid. Pipeline and modeling designed together — DDL and pipeline config output simultaneously. Core principle: use Dynamic Table for aggregation layers, not Materialized View. Trigger when the user says: "data warehouse modeling", "layering design", "ODS/DWD/DWS", "Medallion", "Bronze/Silver/Gold", "fact table", "dimension table", "star schema", "layered architecture", "modeling wizard", "build a data warehouse", "DW architecture", "design my data layers". Keywords: data warehouse, modeling, star schema, medallion, ODS, DWD, DWS, ADS, layering

ClickZetta Data Warehouse Modeling Wizard

See references/modeling-patterns.md for layer definitions, naming conventions, DDL templates, scheduling DAGs, and common pitfalls for all three layering patterns.

Working Mode: Explore First, Then Recommend

Do not ask questionnaire-style questions. Look at the data first, then present justified choices.

The user should answer at most 2 questions:

Choose from the presented options (A/B/C)
Provide information that cannot be inferred (business purpose, query patterns)

Phase 1: Autonomous Data Exploration

On receiving a modeling request, immediately run the following — do not ask the user anything first:

-- Step 1: List all schemas
SHOW SCHEMAS;

-- Step 2: List tables in schemas that look like business data
SHOW TABLES IN <schema>;

-- Step 3: Check table sizes and row counts
SELECT table_schema, table_name, table_type,
       ROUND(bytes/1024.0/1024/1024, 2) AS size_gb,
       row_count, last_modify_time
FROM information_schema.tables
WHERE table_type = 'MANAGED_TABLE'
ORDER BY bytes DESC NULLS LAST LIMIT 20;

-- Step 4: Sample the 2-3 largest tables
SELECT * FROM <schema>.<table> LIMIT 5;

Inference logic:

Observed characteristic	Inference
Table name contains order/user/product/trade	Business DB raw data → ODS/Bronze
Table name contains log/event/track/click	Tracking/log data, high volume, needs partitioning
Table name contains dw/ods/dwd/dws/ads	Already layered — evaluate existing structure
Table name contains tmp/temp/bak	Temporary — exclude from modeling scope
Fields contain _op/_ts/binlog	CDC-synced data
Single table > 10GB	Needs partitioning + bucketing

Phase 2: Present Justified Recommendations

Based on exploration results, present three sections:

1. Data landscape summary (self-summarize, do not ask)

Here's what I found:
- `raw` schema: orders (2.3GB / 12M rows), users (450MB)
  → Fields look like MySQL business DB sync; orders has _op/_ts (CDC ingestion)
- `events` schema: user_events (18GB / 800M rows)
  → Fields include event_time, event_type — tracking/log data
- No existing layered structure found

2. Option choices (A/B or A/B/C, no more than 3)

Based on the above, two directions:

A. Traditional DW (ODS → DWD → DWS → ADS)
   Reuse existing data as ODS. New: DWD (cleanse) + DWS (aggregate, Dynamic Table) + ADS (output)
   Best for: BI reporting as primary use case, well-defined metric system

B. Medallion (Bronze → Silver → Gold)
   Reuse existing data as Bronze. New: Silver (standardize) + Gold (metrics, Dynamic Table)
   Best for: multi-scenario reuse — both BI and data science

3. Ask only one question

What do you primarily use this data for?
- BI reporting (fixed reports, well-defined metrics) → recommend A
- Multi-scenario (reporting + analytics + data science) → recommend B
- Real-time dashboards (minute-level latency) → let me know, the plan will adjust

4. Flag cost considerations when confirming

Dynamic Table refresh frequency: match business freshness needs, not maximum frequency. T+1 → 1 DAY, hourly → 1 HOUR, minute-level → 10~30 MINUTE
Number of layers: each additional DT layer adds refresh cost — evaluate whether all layers are necessary
VCluster size: start small, scale up based on actual consumption

Phase 3: Complete Output After Confirmation

After the user selects a direction, deliver the complete plan in one pass. Read references/modeling-patterns.md for DDL templates and scheduling DAG patterns.

Layer structure

Traditional DW:

Layer	Type	Notes
ODS	Regular table	Raw, no transformation
DWD	Regular table	Cleanse and standardize
DWS	Dynamic Table	Incremental aggregation, auto-refresh
ADS	Dynamic Table	Application-facing

Medallion:

Layer	Type	Notes
Bronze	Regular table	Zero transformation, preserve raw
Silver	Regular table or Dynamic Table	Cleanse and standardize
Gold	Dynamic Table	Aggregated metrics, auto-refresh

Use Dynamic Table for aggregation layers — not Materialized View. Dynamic Table uses CBO incremental compute, only refreshes changed partitions, and supports Time Travel.

Data ingestion pipeline

Based on data source characteristics found during exploration:

Data source characteristic	Recommended pipeline	Skill
Has _op/_ts fields (CDC)	CDC sync	`clickzetta-cdc-sync-pipeline`
Kafka message data	Kafka Pipe	`clickzetta-kafka-ingest-pipeline`
OSS/S3 files	OSS Pipe	`clickzetta-oss-ingest-pipeline`
Regular DB tables (no CDC markers)	Batch sync	`clickzetta-batch-sync-pipeline`

Partition and bucketing

Single table < 1GB:    no partitioning
1GB – 100GB:           PARTITIONED BY (days(event_date))
> 100GB:               PARTITIONED BY (days(event_date))
                       CLUSTERED BY (user_id) INTO 32 BUCKETS

Note: ClickZetta uses PARTITIONED BY (days(col)), not PARTITIONED BY (col).

Separation of DDL and scheduling

Task type	Scheduling config	Studio state
DDL (CREATE TABLE, CREATE SCHEMA)	No Cron, no dependencies	DRAFT
ETL transform / data sync	Cron + upstream dependencies	PUBLISHED
DWS/ADS aggregation (Dynamic Table)	No Studio task — system auto-refreshes	—

Common mistake: configuring Cron on DDL tasks causes repeated execution and scheduling conflicts. DDL tasks must stay DRAFT.

Studio task directory

<domain>_dw/
├── 00_sync_<source>_to_ods    ← data sync (Cron, runs first)
├── 01_ddl_ods                 ← ODS DDL (DRAFT, no scheduling)
├── 02_ddl_dwd                 ← DWD DDL (DRAFT, no scheduling)
├── 03_ddl_dws_ads             ← DWS/ADS dynamic table DDL (DRAFT, no scheduling)
├── 04_transform_ods_to_dwd    ← ODS→DWD cleansing (Cron, depends on 00)
└── 05_dqc_check               ← data quality check (optional)

Data quality checkpoints

Layer	Key checks
ODS/Bronze	NULL rate, CDC _op distribution, row count matches source
DWD/Silver	Uniqueness, LEFT JOIN match rate, key field non-null rate
DWS/Gold/ADS	Metric anomalies, Dynamic Table refresh history = SUCCESS

Delivery checklist

Row counts match expectations at each layer
Dynamic Table refresh history shows SUCCESS
Key field NULL rate within acceptable range
No redundant scheduled tasks for DWS/ADS layer
All DDL tasks in DRAFT state

Core Principles

Explore data first, then recommend — no questionnaires
Give choices, not blank forms — user picks A/B
Dynamic Table for aggregation layers, not Materialized View
Modeling and pipeline together — DDL and pipeline config output simultaneously
ODS/Bronze zero transformation — preserve raw data for traceability
Separate DDL from scheduling — DDL tasks DRAFT, DWS/ADS no scheduled tasks
REFRESH immediately after creating Dynamic Table — reset refresh baseline

Routing

Scenario	Route to
Want to use dbt for modeling instead of raw SQL	`clickzetta-dbt-project-setup`
Need SQL pipeline objects only (DT/MV/Stream/Pipe)	`clickzetta-sql-pipeline-manager`
Need to sync data from MySQL/PostgreSQL in real-time	`clickzetta-cdc-sync-pipeline`
Need to sync data from any DB in batch	`clickzetta-batch-sync-pipeline`
Not sure which ingestion method to use	`clickzetta-data-ingest-pipeline`
Need comprehensive pipeline health check	`clickzetta-pipeline-review`