name: databricks-autonomous-operations description: > End-to-end autonomous deployment and operations skill. Deploys Databricks Asset Bundles, runs jobs and pipelines, polls for completion, diagnoses failures, applies fixes, redeploys, and verifies — all without human intervention. Also serves as SDK/CLI/REST API reference. Use when deploying bundles, running jobs, monitoring pipelines, troubleshooting ANY failure, or operating in a self-healing deploy-fix-redeploy cycle. Triggers on "deploy", "bundle deploy", "run job", "run pipeline", "make it work", "job failed", "troubleshoot", "fix and redeploy". metadata: author: prashanth subrahmanyam version: "3.2" domain: operations role: shared used_by_stages: [1, 2, 3, 4, 5, 6, 7, 8, 9] called_by: - bronze/00-bronze-layer-setup - silver/00-silver-layer-setup - gold/01-gold-layer-setup - semantic-layer/00-semantic-layer-setup - monitoring/00-observability-setup - ml/00-ml-pipeline-setup - genai-agents/00-course-orchestrator dependencies: - admin/self-improvement triggers: - "deploy" - "bundle deploy" - "bundle run" - "run job" - "run pipeline" - "make it work" - "deploy and run" - "deploy and fix" - "job failed" - "monitor job" - "troubleshoot" - "get-run" - "run output" - "task failed" - "redeploy" - "job status" - "pipeline failed" - "DLT error" - "cluster issue" - "monitor failed" - "alert failed" - "deploy failed" - "databricks-sdk" - "databricks-connect" - "CLI" - "INTERNAL_ERROR" - "UNRESOLVED_COLUMN" - "ModuleNotFoundError" - "TABLE_OR_VIEW_NOT_FOUND" - "ResourceAlreadyExists" - "PERMISSION_DENIED" - "self-heal" - "autonomous" last_verified: "2026-06-02" volatility: medium clients: [ide_cli, genie_code] # CLI surface accessed via local shell (IDE) or runDatabricksCli (Genie Code) deploy_verb: "bundle deploy --target dev" # deploy mechanics owned by databricks-asset-bundles (the spine) deploy_note: "operations/CLI reference — on Genie Code every databricks command routes via runDatabricksCli FROM THE BUNDLE EDITOR (dp_bundle_root); a blocked bundle deploy/run is a page-context signal, NEVER substitute SDK/REST creation (RULE_10); see genie-code-environment" coverage: all_stages upstream_sources: - name: "ai-dev-kit" repo: "databricks-solutions/ai-dev-kit" paths: - "databricks-skills/databricks-python-sdk/SKILL.md" - "databricks-skills/databricks-jobs/SKILL.md" relationship: "extended" last_synced: "2026-02-20" sync_commit: "latest"
Databricks Autonomous Operations
1. Overview
Context Loading Rule: Load this skill at deployment time, not during planning. During earlier phases, retain only this 3-step summary in working memory:
Deploy (bundle deploy) → Poll (get-run / pipelines get) → Verify (get-run-output). Full skill activation should happen when the firstbundle deployorbundle runis invoked.
This skill is both an SDK/CLI/Connect reference and an autonomous operations playbook. It teaches the AI agent to operate as an SRE — independently deploying, monitoring, diagnosing failures, applying fixes, redeploying, and verifying results across all Databricks resource types.
Core Loop: Deploy → Poll → Diagnose → Fix → Redeploy → Verify (max 3 iterations before escalation to user).
When to Activate This Skill
- Deploying or running Databricks Asset Bundles (
bundle deploy,bundle run) - Monitoring job or pipeline runs for completion
- Troubleshooting ANY failure: jobs, DLT pipelines, monitors, alerts, clusters, Genie Spaces
- Using the Databricks Python SDK, CLI, Connect, or REST API
- Encountering error messages from Databricks services
- Operating in a self-healing deploy-fix-redeploy cycle
- Checking job/task/pipeline status or retrieving run output
SDK Docs: https://databricks-sdk-py.readthedocs.io/en/latest/ GitHub: https://github.com/databricks/databricks-sdk-py
Runnable Examples: See examples/ directory for complete, copy-pasteable scripts:
examples/1-authentication.py— All auth patterns (env vars, profiles, Azure SP, AccountClient, notebook context)examples/2-clusters-and-jobs.py— Cluster auto-selection, autoscaling, job CRUD, submit_and_wait, run_now_and_waitexamples/3-sql-and-warehouses.py— Parameterized queries, chunked results, query_to_dataframe helperexamples/4-unity-catalog.py— Tables, schemas, catalogs, volumes, file operations, pattern matchingexamples/5-serving-and-vector-search.py— Endpoint creation with TrafficConfig, chat/embedding queries, vector search with filtersexamples/6-autonomous-operations.py— Job monitoring, multi-task output retrieval, failure diagnosis, self-healing loop
2. Environment & Authentication
Client routing (RULE_1/2/4 — read once, applies to every command in this skill). This skill is the CLI/SDK operations surface; the same operations run on both clients via different channels — deploy mechanics are owned by
databricks-asset-bundles(the spine); reference it rather than re-deriving them.
IDE (Cursor): the local
databricksCLI; auth viadatabricks auth login/~/.databrickscfg;databricks-connectis available for local Spark.Genie Code has three execution paths — try them in order; "blocked on one path ≠ impossible":
runDatabricksCli— the allow-listed, pre-authenticated CLI path (noauth login). The path forbundle validate/summary/deploy --target devand read verbs.--version/help/auth token/aitools/apps validate/apps manifestare hard-blocked;apps deployis unreliable here (page-dependent + CWD-defeated). Use abundle validatebehavior probe instead of a numeric--versioncompare.- Python SDK (
WorkspaceClientviaexecuteCode) — the most capable path: it bypasses the CLI allow-list and is the reliable way tow.apps.deploy(...), retrievew.config.token, and poll deployment/run state. Caveat: the SDK has no bundle-deploy equivalent (bundle deployis a composite client-side operation) — keepbundle deployonrunDatabricksCli.- Native tools (
createAsset/readTable/…) for governed asset operations.No local Spark on Genie Code (use workspace serverless;
databricks-connectis IDE-only). Full allow-list / CWD / FUSE / escape-hatch detail is in thegenie-code-environmentskill — load it on demand. The CLI command examples below run via path 1 on Genie Code (local shell on the IDE); where a verb is CLI-blocked, reach for the SDK (path 2).⛔ The carve-out the "fall to the SDK" rule does NOT cover —
bundle deploy/runand resource creation. "Blocked ≠ impossible, try the SDK" is for read-only / ad-hoc ops (polling, inspecting schemas, lineage, token retrieval,apps deploy). It is NOT a license to substitute the SDK/REST for the bundle. Whenbundle deploy/runis blocked, that is a page-context signal, not a dead end: the verb is gated to the bundle-folder page, which on Genie Code you reach by opening the "Open in bundle editor" affordance on thedp_bundle_rootfolder (the bundle editor's CWD is the bundle root, wherevalidate/deploy/runare pre-approved). FIELD-CONFIRMED: the samebundle deploythat returned "blocked by safety guardrails" / "databricks.ymlnot found" from a file/notebook page returned "Deployment complete!" andbundle run … SUCCESSfrom the bundle editor. So the fix is navigate to the bundle editor, neverw.jobs.create()/POST /api/2.1/jobs/create/POST /api/2.0/pipelines/CREATE TABLEviaexecuteCode. Creating jobs/pipelines/tables directly is the RULE_10 authoring-discipline violation — it produces live, un-versioned state that diverges from the bundle and is the exact regression this spine prevents. Ifbundle deploy/runstill fails from the bundle editor, STOP and report the blocker — the SDK/REST creation route is an escape hatch only on explicit operator authorization. Surface a clickable bundle-editor link:file_id = w.workspace.get_status("<dp_bundle_root>/databricks.yml").object_id,folder_id = w.workspace.get_status("<dp_bundle_root>").object_id, link ={w.config.host}/editor/files/{file_id}?o={w.get_workspace_id()}&contextId=folder%3A{folder_id}. Detail ingenie-code-environment§3/§8.
Setup
- SDK:
uv pip install databricks-sdk; Connect (IDE-only):uv pip install databricks-connect— not used on Genie Code (serverless; no local Spark) - CLI version: >= 0.278.0 on the IDE (
databricks --version); on Genie Code the version is not introspectable (use abundle validatebehavior probe) - Config (IDE):
~/.databrickscfgor env varsDATABRICKS_HOST,DATABRICKS_TOKEN; Genie Code is pre-authenticated in-session
Quick Auth
from databricks.sdk import WorkspaceClient
w = WorkspaceClient() # Auto-detect (env, config file, or notebook)
w = WorkspaceClient(profile="MY_PROFILE") # Named profile
- Token expired? (IDE path)
databricks auth login --host <url> --profile <name>— N/A on Genie Code (pre-authenticated) - Profile-based CLI:
DATABRICKS_CONFIG_PROFILE=<name> databricks <command> - Full auth patterns (Azure SP, AccountClient, etc.): see
references/sdk-api-reference.md
3. SDK API Quick Reference
Full code examples and patterns: references/sdk-api-reference.md
Runnable scripts: examples/ directory (see file listing in Section 1)
| API | Key Operations | Critical Notes |
|---|---|---|
| Clusters | list, get, create_and_wait, ensure_cluster_is_running, start, stop |
Use select_spark_version() and select_node_type() |
| Jobs | list, run_now_and_wait, submit_and_wait, get_run, get_run_output, cancel_run |
get_run_output needs task run_id, not parent job run_id |
| SQL Execution | execute_statement with wait_timeout |
Use StatementParameterListItem for parameterized queries |
| Warehouses | list, start, stop, create_and_wait |
Use auto_stop_mins to control costs |
| Unity Catalog | tables.list, tables.get, tables.exists, catalogs.list, schemas.list |
list_summaries for fast pattern matching |
| Volumes/Files | files.upload, files.download, files.list_directory_contents |
Path format: /Volumes/catalog/schema/volume/file |
| Serving | create_and_wait, query (custom/chat/embeddings), get_open_ai_client |
scale_to_zero_enabled for cost control |
| Vector Search | query_index, upsert_data_vector_index, sync_index |
Use filters_json for filtered queries |
| Pipelines | list_pipelines, get, start_update, stop_and_wait, list_pipeline_events |
Events contain error details for troubleshooting |
| Monitors | quality_monitors.get, delete, update, run_refresh |
Update MUST include ALL custom_metrics |
| Alerts V2 | alerts_v2.create_alert, update_alert, delete_alert |
Update requires full update_mask |
| Secrets | create_scope, put_secret, get_secret |
Use for API keys, tokens |
Critical Patterns
- Async apps (FastAPI): SDK is synchronous — wrap with
asyncio.to_thread() - Long-running ops: Use
_and_wait()variants withtimeout=timedelta(...) - Error handling:
from databricks.sdk.errors import NotFound, PermissionDenied, ResourceAlreadyExists - Direct REST:
w.api_client.do(method="GET", path="/api/2.0/...")
4. CLI Operations Reference
Command Matrix
| Category | Command | Purpose |
|---|---|---|
| Bundle | databricks bundle validate |
Pre-deploy validation (catches ~80% of errors) |
| Bundle | databricks bundle deploy -t <target> |
Deploy all resources |
| Bundle | databricks bundle run -t <target> <job> |
Start a job run |
| Bundle | databricks bundle destroy -t <target> |
Cleanup all resources |
| Jobs | databricks jobs get-run <RUN_ID> --output json |
Get run status |
| Jobs | databricks jobs get-run-output <TASK_RUN_ID> --output json |
Get task notebook output |
| Jobs | databricks jobs list-runs --job-id <JOB_ID> --output json |
List recent runs |
| Jobs | databricks jobs cancel-run <RUN_ID> |
Cancel running job |
| Pipelines | databricks pipelines get <PIPELINE_ID> --output json |
Get pipeline status |
| Pipelines | databricks pipelines get-update <PID> <UID> --output json |
Get update details |
| Pipelines | databricks pipelines list-pipeline-events <PID> --output json |
Get events/errors |
| Pipelines | databricks pipelines start-update <PIPELINE_ID> |
Start pipeline update |
| Clusters | databricks clusters get <CLUSTER_ID> --output json |
Get cluster status |
| Clusters | databricks clusters events <CLUSTER_ID> --output json |
Get cluster events |
| Warehouses | databricks warehouses get <WH_ID> --output json |
Get warehouse status |
| Auth (IDE only) | databricks auth login --host <url> --profile <name> |
Re-authenticate (Genie Code is pre-authenticated) |
| Workspace | databricks workspace export <path> --format SOURCE |
Export notebook |
| Apps | databricks apps logs <app-name> |
App deployment/runtime logs |
Key jq Patterns
See references/cli-jq-patterns.md for the complete catalog. Most critical:
# Job state
databricks jobs get-run <RUN_ID> --output json | jq '.state'
# Task summary for multi-task jobs
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | {task: .task_key, run_id: .run_id, result: .state.result_state}'
# Failed tasks only
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message, url: .run_page_url}'
# Task output (MUST use task run_id, not parent job run_id)
databricks jobs get-run-output <TASK_RUN_ID> --output json | jq -r '.notebook_output.result // "No output"'
5. Autonomous Deploy-Run-Fix Playbook
This is the core workflow. Follow these steps whenever deploying a bundle, running a job, or operating autonomously.
Step 0: Discover Bundle Resources
Before deploying, understand what's in the bundle:
# Read databricks.yml to find targets and included resources
# Then read resource YAML files under resources/ directory
# Identify: jobs (atomic/composite/orchestrator), pipelines, dashboards, alerts
Determine the target (dev, staging, prod) from the user or default to dev.
Step 0.5: Resolve Workspace Identity
Before ANY deployment, confirm you are targeting the correct workspace:
- Check
databricks.ymlforworkspace.hostandprofileunder the target - Check active profile:
databricks auth env --profile <name>— verify host matches intent - If multiple profiles exist in
~/.databrickscfg, do NOT auto-select — present options to user - If
databricks.ymlhas no profile set, ask the user which workspace to target - Scan repo context before creating a new
databricks.yml:Globfor**/databricks.ymlacross the repository. If an existing bundle config specifieshostorprofile, surface those as the suggested default in your question to the user. Example: "Your existing app config atapps_lakebase/databricks.ymluses profileazure-demo→ hosthttps://adb-xxx.azuredatabricks.net/. Should I use that same workspace for this bundle?" - When
bundle validatereports "multiple profiles matched" the host, include the repo-context hint in your question rather than listing all candidates blindly. Example: "Multiple profiles match this host. Yourapps_lakebase/databricks.ymluses profileazure-demo— should I use that, or a different one?"
CLI profile inheritance warning: Raw databricks api calls do NOT inherit the profile from databricks.yml. Always pass -p <profile> explicitly when using databricks api.
If deployed to wrong workspace → see Step 7.5 for cleanup.
Step 1: Validate & Deploy
Client note: IDE runs these in a terminal. Genie Code runs
databricks bundle …viarunDatabricksClifrom the bundle editor — open the "Open in bundle editor" affordance on thedp_bundle_rootfolder (the icon that appears next todatabricks.yml) before deploying; the bundle editor's CWD is the bundle root, wherevalidate/deploy/runare pre-approved. Surface the clickable editor link for the operator (see §2 carve-out). Seeskills/genie-code-environment§3.
databricks bundle validate -t <target> # Pre-flight — catches ~80% of errors
databricks bundle deploy -t <target> # Deploy all resources
If validate fails: Read the error, fix the YAML (see Section 6 + skills/databricks-asset-bundles skill), re-validate.
If deploy is BLOCKED (≠ failed — Genie Code): symptoms are "blocked by safety guardrails," "not in the
allow-list," or "databricks.yml not found." This is a page-context signal, not a YAML/code defect and not
a job failure. Do not enter the diagnose-fix loop and do not create resources via SDK/REST/CREATE as a
substitute (RULE_10 violation). Navigate to the bundle editor on dp_bundle_root and retry. Only on explicit
operator authorization is the escape hatch (§2) permitted.
If deploy FAILS (executes, returns an error): Common causes: auth expired (403), path resolution errors, invalid task types. See Section 6.
--force clarification: --force handles Terraform state drift only (e.g., resource deleted outside bundle). It does NOT fix API name-uniqueness conflicts. For "pipeline name already used" or "resource already exists": list and delete the conflicting resource, then redeploy without --force.
Step 2: Run
For jobs:
databricks bundle run -t <target> <job_name>
# Output includes a Run URL like: https://<host>/#job/<JOB_ID>/run/<RUN_ID>
# EXTRACT the RUN_ID from this URL — you need it for polling
For DLT pipelines:
databricks bundle run -t <target> <pipeline_name>
# Output includes: Pipeline URL with PIPELINE_ID
# Also shows UPDATE_ID for this specific update
Cursor-specific: Use block_until_ms: 0 to background bundle run immediately. Then read the terminal output file to capture the Run URL. Parse the RUN_ID or PIPELINE_ID from the URL.
Step 3: Poll Until Terminal State
Jobs — poll with exponential backoff (30s → 60s → 120s):
databricks jobs get-run <RUN_ID> --output json | jq -r '.state.life_cycle_state'
# PENDING → RUNNING → TERMINATED (also: INTERNAL_ERROR, SKIPPED)
When life_cycle_state == TERMINATED, check the result:
databricks jobs get-run <RUN_ID> --output json | jq -r '.state.result_state'
# SUCCESS → go to Step 4a FAILED → go to Step 4b
DLT Pipelines — poll the latest update state:
databricks pipelines get <PIPELINE_ID> --output json | jq -r '.latest_updates[0].state'
# COMPLETED → success FAILED → get events for errors
Step 4a: On SUCCESS — Verify & Report
# For multi-task jobs, verify all tasks succeeded:
databricks jobs get-run <RUN_ID> --output json \
| jq '.tasks[] | {task: .task_key, result: .state.result_state}'
# Get task output (structured JSON from dbutils.notebook.exit):
databricks jobs get-run-output <TASK_RUN_ID> --output json \
| jq -r '.notebook_output.result // "No output"'
Report success to user with the Run URL for verification. If this was part of a multi-job deployment, proceed to the next job.
Step 4b: On FAILURE — Diagnose
CRITICAL for multi-task jobs: get-run-output only works on the task run_id, NOT the parent job run_id.
# Step A: Find failed tasks and their task-level run_ids
databricks jobs get-run <JOB_RUN_ID> --output json \
| jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, run_id: .run_id, error: .state.state_message}'
# Step B: Get detailed output for EACH failed task
databricks jobs get-run-output <TASK_RUN_ID> --output json \
| jq -r '.notebook_output.result // .error // "No output"'
For DLT pipeline failures:
databricks pipelines list-pipeline-events <PID> --output json \
| jq '[.events[] | select(.level == "ERROR")] | .[0:5]'
Match the error against Section 6 (decision tree) and references/error-solution-matrix.md.
Step 5: Fix
- Read the source file(s) identified from the stack trace or error message
- Apply fix using editor tools (StrReplace, Write)
- If YAML/config issue → fix the DAB YAML (consult
skills/databricks-asset-bundlesskill) - If dependency issue → deploy upstream assets first (see Section 6 ordering)
- Same-class fix rule: Before redeploying, grep ALL files from the same generation batch for the same error pattern. If the bug was in one notebook, check sibling notebooks for identical issues. Fix all instances in one pass.
Step 6: Redeploy & Re-run (Loop)
databricks bundle deploy -t <target> # Redeploy with fix
databricks bundle run -t <target> <job> # Re-run
# → Return to Step 3 (Poll)
Maximum 3 iterations. Track each iteration's error and fix.
Step 7: Escalation (After 3 Failed Attempts)
Present to the user:
- All errors encountered — with run IDs, task keys, error messages
- All fixes attempted — what was changed and why
- Root cause hypothesis — best guess based on evidence
- Run page URLs — direct links to the Databricks UI
Step 7.5: Cleanup After Wrong-Workspace Deploy
If resources were deployed to the wrong workspace:
- Inventory what was deployed: Run
databricks bundle summary -t <target>in the wrong workspace - Destroy cleanly:
databricks bundle destroy -t <target>(requires user confirmation) - Warn about persistent artifacts: Tables, schemas, and data created by jobs are NOT removed by
bundle destroy— these must be dropped manually - Reconcile state: Delete
.databricks/state directory if switching workspaces in the same repo - Redeploy to correct workspace: Update profile/target, then restart from Step 0.5
Trigger this step proactively whenever the user says "use profile X" / "switch to workspace Y" / "actually deploy to Z" AND you have already deployed in the current session. Before re-deploying to the new workspace, inventory what is running in the old workspace and offer bundle destroy on it. Do not silently re-deploy — orphaned jobs in the previous workspace are exactly the failure mode this step exists to prevent.
Step 8: Capture Learnings
After every resolved failure (or escalation), trigger admin/self-improvement:
- Add new error → fix mappings to
references/error-solution-matrix.md - Update "Top 10" table in Section 6 if this error is frequent
- Always capture if fix took 2+ iterations or was escalated
6. Master Troubleshooting Decision Tree
Full error-solution tables: references/error-solution-matrix.md (7 categories, 40+ patterns)
DLT-specific troubleshooting: references/dlt-pipeline-troubleshooting.md
Diagnostic SQL queries: references/diagnostic-queries.md
Quick Decision Guide
- Deployment error? → Check auth (
403), YAML syntax, task type (notebook_tasknotpython_task), path resolution - Job runtime error? → Get task-level output (use task
run_id), match error to solution matrix - DLT pipeline error? → Get events via CLI, check upstream tables exist, verify schema
- Monitor/Alert error? → Check
ResourceAlreadyExists(delete+recreate), schema existence,update_mask - Dependency error? → Follow required order: Bronze → Gold Setup → Gold Merge → Semantic → Monitoring → Genie
- Cluster error? → Check events via CLI, look for OOM/unreachable/timeout patterns
Top 10 Most Common Errors (Inline)
| Error | Quick Fix |
|---|---|
ModuleNotFoundError |
Add to %pip install or DAB environment spec |
TABLE_OR_VIEW_NOT_FOUND |
Run setup job first; check 3-part catalog.schema.table path |
DELTA_MULTIPLE_SOURCE_ROW_MATCHING |
Deduplicate source before MERGE |
Invalid access token (403) |
IDE: databricks auth login --host <url> --profile <name> (Genie Code is pre-authenticated — re-check the page/profile) |
ResourceAlreadyExists |
Delete + recreate (monitors, alerts) |
python_task not recognized |
Use notebook_task with notebook_path |
PARSE_SYNTAX_ERROR |
Read failing SQL file, fix syntax, redeploy |
Parameter not found |
Use base_parameters dict, not CLI-style parameters |
run_job_task vs job_task |
Use run_job_task (not job_task) |
Genie INTERNAL_ERROR |
Deploy semantic layer (TVFs + Metric Views) first |
bundle deploy "blocked by safety guardrails" / "databricks.yml not found" (Genie Code) |
Page context, not a code bug — open the bundle editor on dp_bundle_root and retry; never substitute SDK/REST job creation (RULE_10) |
7. Job Design for Monitorability
Full patterns: references/job-design-patterns.md — structured output, progress messages, fail-fast, partial success, 3-layer hierarchy.
Key rules:
- Every notebook must
dbutils.notebook.exit(json.dumps({...}))with structured status - Print
[Step N/M]progress messages so the agent can locate failures - Fail fast: include actionable error messages (e.g., "Run bronze_setup_job first")
- 3-layer hierarchy: atomic (
notebook_task) → composite (run_job_task) → orchestrator (run_job_task) - Partial success: ≥90% items succeeding = OK; fix individual failures
8. Agent Behavioral Patterns
Background Command Pattern (Cursor-Specific)
For databricks bundle run or any long-running command:
- Execute with
block_until_ms: 0to background immediately - Read the terminal output file to check for completion
- Parse RUN_ID from the output URL
Exponential Backoff for Polling
- Normal jobs: 30s → 60s → 120s (no state change = increase interval)
- Long-running training jobs: start at 60s, max 300s
- DLT pipelines: 30s (typically faster)
Context Capture on Every Failure
On each failure, record and maintain:
run_idandtask_run_idjob_nameandtask_key- Target environment (
dev,staging,prod) - Full error message
run_page_urlfor UI access- Iteration count (which fix attempt this is)
Self-Healing Loop
Iteration 1: Deploy → Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 2: Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 3: Run → Monitor → [FAIL] → ESCALATE TO USER
Dependency Awareness
Before deploying, verify prerequisites exist:
- Genie Spaces require: TVFs + Metric Views (semantic layer)
- Monitoring requires: Gold tables (populated)
- Gold merge requires: Gold tables (setup job)
- Silver DLT requires: Bronze tables
- Alerts require: Monitoring output tables
- Always follow: Bronze → Gold → Semantic → Monitoring → Alerts → Genie
Partial Success Handling
Some jobs use ≥90% success thresholds:
- Alerting deployment: 27/30 alerts succeeding = OK, fix the 3 individually
- Monitoring setup: 7/8 monitors = OK, debug the 1 failure
- Do NOT fail the entire workflow for 1 out of N items failing
CLI Limitations Awareness
Some diagnostics require the Databricks Workspace UI:
- Full notebook execution logs (when
get-run-outputreturns truncated or empty) - DLT pipeline data flow visualization
- Detailed Spark execution plans
Always provide run_page_url to the user when CLI output is insufficient.
Learning Capture Pattern (Self-Improvement Integration)
After every troubleshooting session (successful or escalated), run this checklist:
Post-Troubleshooting Self-Improvement:
1. Was this error already in references/error-solution-matrix.md?
├── YES → Was the documented fix accurate? If not, update it.
└── NO → Add the error pattern and fix to the matrix.
2. Did the initial diagnosis match the actual root cause?
├── YES → No action needed.
└── NO → Document the misleading signals in the relevant skill.
3. How many iterations were needed?
├── 1 → Capture only if error was novel or non-obvious.
├── 2-3 → ALWAYS update the relevant skill with the fix.
└── Escalated → Document what the user found; update skills with user's fix.
4. Is this error likely to recur?
├── YES → Ensure it's in Top 10 table (if frequent) or error-solution matrix.
└── NO → Document as a one-off note in the session, skip skill update.
Skill priority for updates (search in this order):
references/error-solution-matrix.md— error-to-fix mappingcommon/databricks-autonomous-operations— troubleshooting decision treeskills/databricks-asset-bundles— if deployment-related- Domain-specific skill (e.g.,
gold/pipeline-workers/02-merge-patternsfor MERGE errors) - New skill — only if all above are irrelevant (see
admin/self-improvementjustification checklist)
Safety: Never Retry Destructive Operations
Without explicit user confirmation, NEVER retry:
databricks bundle destroyDROP TABLE/DROP SCHEMAw.quality_monitors.delete()w.alerts_v2.delete_alert()
Safety: Never Substitute Direct Creation for a Blocked Bundle Deploy (RULE_10)
A blocked bundle deploy/run (Genie Code "safety guardrails" / "not in allow-list" / "databricks.yml
not found") is a page-context signal, never a license to create the deliverable another way. NEVER, as a
workaround for a blocked deploy:
w.jobs.create()/POST /api/2.1/jobs/createw.pipelines.create()/POST /api/2.0/pipelines/createAsset(pipeline)w.schemas.create()/CREATE SCHEMA/CREATE VOLUME/CREATE TABLEto provision the deliverable
The fix is navigate to the bundle editor on dp_bundle_root and redeploy. The SDK/REST creation route is an
escape hatch only on explicit operator authorization. Likewise, a failed job is fixed by editing the
bundle source file under dp_bundle_root and redeploying — never by patching the live job via API/UI.
References
Reference Documents (Load on Demand)
references/sdk-api-reference.md— Full SDK API with code examples for all servicesreferences/error-solution-matrix.md— Error tables for all 7 resource categories (40+ patterns)references/diagnostic-queries.md— SQL queries for investigationreferences/dlt-pipeline-troubleshooting.md— DLT pipeline failure patternsreferences/cli-jq-patterns.md— jq patterns for parsing CLI JSON outputreferences/cli-quick-reference.md— CLI vs SDK side-by-side tablereferences/job-design-patterns.md— Structured output, progress messages, 3-layer hierarchy
Runnable Examples
examples/1-authentication.pythroughexamples/6-autonomous-operations.py(see Section 1)
Scripts
scripts/monitor_multitask_job.sh— Multi-task job monitoring shell script
External Docs
- SDK: https://databricks-sdk-py.readthedocs.io/en/latest/
- GitHub: https://github.com/databricks/databricks-sdk-py
See Also
- Authoritative upstream: databricks-agent-skills /
databricks-core— back-up reference for canonical Databricks CLI / auth / data exploration patterns.