adhoc-exploration-notebooks - SKILL.md Agent Skill

name: adhoc-exploration-notebooks description: Create dual-format ad-hoc exploration notebooks for Databricks workspace (.py) and local Jupyter (.ipynb) with Databricks Connect. Use when building data exploration tools, debugging data quality issues, or creating interactive analysis notebooks. Supports widget fallback patterns, helper functions for table discovery, and proper Spark session initialization for both environments. clients: [ide_cli, genie_code] bundle_resource: jobs deploy_verb: bundle_deploy deploy_note: "The local-Jupyter path (Databricks Connect + profile-based CLI auth) is the IDE/local branch only. On Genie Code there is no local toolchain: explore directly in the workspace on serverless compute (pre-authenticated — no profile setup, no local Connect session), and deploy the exploration job via `bundle deploy --target dev` + `bundle run` through runDatabricksCli (see `skills/genie-code-environment`). On Genie Code, write generated notebooks under the cloned repo root (`{REPO_ROOT}` = `state_file_root` from `skills/vibecoding-state`), not a bare relative path — relative paths resolve against the page CWD (§8)." coverage: full metadata: author: prashanth subrahmanyam version: "2.0" domain: exploration role: utility standalone: true last_verified: "2026-02-07" volatility: low upstream_sources: [] # Internal notebook patterns

Ad-Hoc Exploration Notebook Patterns

Overview

Every data product should include exploration notebooks in two formats to support different development workflows: Databricks workspace format (.py) for interactive use in Databricks UI, and Jupyter format (.ipynb) for local development with Databricks Connect. Magic commands (%pip, %sql, dbutils.library.restartPython()) only work in Databricks workspace, not with Databricks Connect or local execution.

Quick Start (30 minutes)

Goal: Create interactive notebooks for data exploration (works both in Databricks and locally).

What You'll Create:

adhoc_exploration.py — Databricks workspace version (with magic commands)
adhoc_exploration.ipynb — Local Jupyter version (Databricks Connect)
Standard helper functions (list tables, explore, check quality, compare, show properties)

Before You Begin: Fill in the Requirements Template with your catalog, schemas, and key tables.

Fast Track: Databricks Workspace

# 1. Upload adhoc_exploration.py to Databricks workspace
# 2. Run setup cells (installs databricks-sdk)
# 3. Use helper functions:

list_tables(catalog, "my_schema")
explore_table(catalog, "my_schema", "fact_sales_daily")
check_data_quality(catalog, "my_schema", "fact_sales_daily")
compare_tables(catalog, "silver_schema", "silver_transactions",
               catalog, "gold_schema", "fact_sales_daily")

Fast Track: Local Jupyter

# 1. Install requirements
pip install -r requirements.txt

# 2. Configure Databricks Connect — authenticate per PRE-REQUISITES §11 (IDE/CLI only; Genie Code skips this path — see callout below)

# 3. Launch Jupyter and open adhoc_exploration.ipynb
jupyter lab

Genie Code: there is no local toolchain — skip the Jupyter / Databricks Connect path and explore directly in the workspace on serverless compute (pre-authenticated). Write generated notebooks under the cloned repo root ({REPO_ROOT}), never /tmp. See skills/genie-code-environment §8–§9.

Key Differences:

Databricks (.py): Uses magic commands (%pip, %sql), dbutils.widgets.text() for params
Local (.ipynb): Pure Python, direct variable assignment, requires Databricks Connect

When to Use This Skill

Building data exploration tools for Bronze/Silver/Gold layers
Debugging data quality issues
Creating interactive analysis notebooks
Validating schema changes
Testing query patterns before production
Testing Metric Views and monitoring tables
Cross-layer data flow validation (Bronze → Silver → Gold)

Critical Rules

1. Dual-Format Requirement

CRITICAL: Create both .py (Databricks workspace) and .ipynb (local Jupyter) formats.

Why: Magic commands (%pip, %sql, dbutils.library.restartPython()) only work in Databricks workspace, not with Databricks Connect.

2. Spark Session Initialization

CRITICAL: Must specify either .serverless() or .clusterId() — default will fail:

# ✅ CORRECT
spark = DatabricksSession.builder.serverless().profile("profile").getOrCreate()
# OR
spark = DatabricksSession.builder.clusterId("cluster-id").getOrCreate()

# ❌ WRONG
spark = DatabricksSession.builder.getOrCreate()  # Exception: Cluster id or serverless required

3. Widget Fallback Pattern

Always use widget fallback pattern for cross-environment compatibility:

# Default values
catalog = "default_catalog"

# Try widgets (Databricks workspace), fall back to defaults
try:
    dbutils.widgets.text("catalog", catalog, "Catalog")
    catalog = dbutils.widgets.get("catalog")
except Exception:
    # Widgets not available (Databricks Connect)
    pass

Quick Reference

File Structure

src/exploration/
├── adhoc_exploration.py              # Databricks workspace format
├── adhoc_exploration.ipynb           # Local Jupyter format
├── requirements.txt                  # Python dependencies
├── README.md                         # Full documentation
└── QUICKSTART.md                     # 5-minute setup guide

Standard Helper Functions

Every exploration notebook should include:

Table Discovery: list_tables(catalog_name, schema_name) — List all tables with metadata
Table Exploration: explore_table(catalog_name, schema_name, table_name, limit) — Schema + sample data + row count
Data Quality: check_data_quality(catalog_name, schema_name, table_name) — Null counts, duplicates
Comparison: compare_tables(catalog1, schema1, table1, catalog2, schema2, table2) — Compare across layers
Governance: show_table_properties(catalog_name, schema_name, table_name) — View metadata, tags, TBLPROPERTIES

See Notebook Patterns for complete implementations.

Core Patterns

Databricks Workspace Format (.py)

Key Features:

# Databricks notebook source header
# COMMAND ---------- cell separators
# MAGIC %md for markdown cells
Widget support with fallback pattern
Magic commands work (%pip, %sql)

See Notebook Patterns for complete template.

Local Jupyter Format (.ipynb)

Key Features:

Standard Jupyter notebook format
No Databricks-specific headers
No magic commands
Direct variable assignment (no widgets)
Setup instructions in markdown cell

Requirements:

databricks-sdk[notebook]>=0.28.0
pyspark>=3.5.0

Multiple Execution Options

Option 1: Run via Databricks UI

Open the notebook in Databricks workspace
Attach to a cluster
Adjust widget parameters if needed
Run cells interactively

Option 2: Run via Asset Bundle

Critical: Include .py extension in notebook path:

notebook_task:
  notebook_path: ../src/exploration/adhoc_exploration.py  # ✅ Include .py extension
  base_parameters:
    catalog: ${var.catalog}

Client note: IDE runs these in a terminal; Genie Code runs the databricks bundle … commands via runDatabricksCli (be on the bundle's page; generated files anchor to {REPO_ROOT}). See skills/genie-code-environment.

# Deploy and run
databricks bundle deploy -t dev
databricks bundle run -t dev adhoc_exploration_job

See Notebook Patterns for complete YAML example.

Option 3: Direct Execution via %run

# In any Databricks notebook, import and use the functions:
%run ./exploration/adhoc_exploration

# Then use the helper functions
list_tables(catalog, bronze_schema)
df = explore_table(catalog, gold_schema, "fact_sales_daily")

Common Mistakes to Avoid

❌ Mistake 1: Missing .serverless() or .clusterId()

# ❌ DON'T: Use default builder
spark = DatabricksSession.builder.getOrCreate()
# Exception: Cluster id or serverless are required

✅ Correct: Specify execution environment

spark = DatabricksSession.builder.serverless().profile("profile").getOrCreate()

❌ Mistake 2: Forgetting File Extension in Bundle

# ❌ DON'T: Omit .py extension
notebook_path: ../src/exploration/adhoc_exploration
# Error: notebook not found

✅ Correct: Include .py extension

notebook_path: ../src/exploration/adhoc_exploration.py

❌ Mistake 3: Using Magic Commands Locally

# ❌ DON'T: Use magic commands
%pip install 'databricks-sdk[notebook]'
# SyntaxError: invalid syntax

✅ Correct: Install packages before running

pip install -r requirements.txt

Validation Checklist

File Structure

Both .py and .ipynb formats created
requirements.txt with dependencies
README.md with full documentation
QUICKSTART.md with 5-minute setup

.py File (Databricks Workspace)

Has # Databricks notebook source header
Uses # COMMAND ---------- cell separators
Markdown cells use # MAGIC %md
Initializes Spark with .serverless() or .clusterId()
Uses widget fallback pattern
All helper functions defined

.ipynb File (Local Jupyter)

Standard Jupyter notebook format
No Databricks-specific headers
No magic commands
Direct variable assignment (no widgets)
Setup instructions in markdown cell
Same helper functions as .py

Helper Functions

list_tables() — Table discovery
explore_table() — Schema + samples
check_data_quality() — DQ checks
compare_tables() — Layer comparison
show_table_properties() — Governance metadata

Reference Files

Notebook Patterns

references/notebook-patterns.md — Complete code patterns for both .py and .ipynb formats, helper function implementations, widget fallback patterns, Spark session initialization, dev/prod configuration, %run import patterns, and Asset Bundle integration.

Analysis Workflows

references/analysis-workflows.md — Step-by-step analysis workflows for data discovery, data quality investigation, schema validation, Metric View testing, monitoring metrics review, and cross-layer comparison.

Tips & Troubleshooting

references/tips-and-troubleshooting.md — Visualization tips, summary statistics, performance analysis, extending notebooks with custom functions and matplotlib/plotly, troubleshooting common errors (table not found, permission denied, slow queries, connection issues), and best practices.

Assets

Exploration Notebook Template

assets/templates/exploration-notebook.py — Starter notebook template with all standard helper functions and widget fallback patterns.

Requirements Template

assets/templates/requirements-template.md — Fill-in-the-blank planning template for catalog configuration, environment settings, tables to explore, DQ focus areas, and cross-layer comparisons.

References

Official Databricks Documentation

Related Patterns

Databricks Asset Bundles — Deployment patterns
Python File Imports — Sharing code between notebooks
DLT Expectations — DQ rules and validation
Metric Views — Metric View testing patterns
Observability Setup — Monitoring table patterns