spark-operations-cli

star 522

Diagnose failed Spark jobs, unhealthy Livy sessions, and performance bottlenecks in Microsoft Fabric via read-only CLI triage. Use when the user wants to: (1) diagnose why a Spark job, notebook run, or Lakehouse job failed, (2) triage stuck or dead Livy sessions, (3) identify OOM, shuffle spill, or data skew, (4) retrieve driver and executor logs or Spark Advisor findings, (5) copy event logs and start a local Spark History Server, (6) diagnose all Spark activities within a failed pipeline run. Triggers: "diagnose my failed notebook", "why did my spark job fail", "triage spark failure", "diagnose pipeline run failure", "why did my pipeline fail", "livy session stuck in starting", "spark executor OOM", "check spark advisor findings", "shuffle spill diagnosis", "why did my lakehouse job fail", "diagnose lakehouse table load", "data skew diagnosis", "open spark history server locally", "analyze spark failure logs", "spark job triage".

microsoft By microsoft schedule Updated 6/3/2026

name: spark-operations-cli description: > Diagnose failed Spark jobs, unhealthy Livy sessions, and performance bottlenecks in Microsoft Fabric via read-only CLI triage. Use when the user wants to: (1) diagnose why a Spark job, notebook run, or Lakehouse job failed, (2) triage stuck or dead Livy sessions, (3) identify OOM, shuffle spill, or data skew, (4) retrieve driver and executor logs or Spark Advisor findings, (5) copy event logs and start a local Spark History Server, (6) diagnose all Spark activities within a failed pipeline run. Triggers: "diagnose my failed notebook", "why did my spark job fail", "triage spark failure", "diagnose pipeline run failure", "why did my pipeline fail", "livy session stuck in starting", "spark executor OOM", "check spark advisor findings", "shuffle spill diagnosis", "why did my lakehouse job fail", "diagnose lakehouse table load", "data skew diagnosis", "open spark history server locally", "analyze spark failure logs", "spark job triage".

Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.

  • GitHub Copilot CLI / VS Code: invoke the check-updates skill.
  • Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
  • Skip if the check was already performed earlier in this session.

CRITICAL NOTES

  1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
  2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
  3. Skill disambiguation: spark-operations-cli is for read-only triage and diagnosis of existing jobs and sessions. For creating notebooks, running new jobs, or Spark development, use spark-authoring-cli. For interactive PySpark analysis and Livy session creation, use spark-consumption-cli.

Spark Operations — CLI Skill

This skill provides diagnostics for Microsoft Fabric Spark job failures, Livy session health, and performance bottlenecks using Fabric REST APIs and CLI tools (az rest). All diagnostic operations are read-only; session cleanup (e.g., stopping zombie sessions) requires explicit user confirmation. For Spark development and notebook authoring, use spark-authoring-cli. For interactive PySpark analysis, use spark-consumption-cli.

Table of Contents

The TOC is grouped by purpose. Start at Diagnostic Workflows when triaging an active failure; the earlier sections are foundational references.

1. Fabric Foundations (concepts)

Task Reference Notes
Fabric Topology & Key Concepts COMMON-CORE.md § Fabric Topology & Key Concepts
Environment URLs COMMON-CORE.md § Environment URLs
Authentication & Token Acquisition COMMON-CORE.md § Authentication & Token Acquisition Wrong audience = 401; read before any auth issue
Core Control-Plane REST APIs COMMON-CORE.md § Core Control-Plane REST APIs
Pagination COMMON-CORE.md § Pagination
Long-Running Operations (LRO) COMMON-CORE.md § Long-Running Operations (LRO)
Rate Limiting & Throttling COMMON-CORE.md § Rate Limiting & Throttling
Job Execution COMMON-CORE.md § Job Execution
Capacity Management COMMON-CORE.md § Capacity Management
Gotchas & Troubleshooting COMMON-CORE.md § Gotchas & Troubleshooting
Best Practices COMMON-CORE.md § Best Practices

2. CLI Setup & Authentication

Task Reference Notes
Tool Selection Rationale COMMON-CLI.md § Tool Selection Rationale
Finding Workspaces and Items in Fabric COMMON-CLI.md § Finding Workspaces and Items in Fabric MandatoryREAD link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id]
Authentication Recipes COMMON-CLI.md § Authentication Recipes az login flows and token acquisition
Fabric Control-Plane API via az rest COMMON-CLI.md § Fabric Control-Plane API via az rest Always pass --resource https://api.fabric.microsoft.com or az rest fails
Pagination Pattern COMMON-CLI.md § Pagination Pattern
Long-Running Operations (LRO) Pattern COMMON-CLI.md § Long-Running Operations (LRO) Pattern
Gotchas & Troubleshooting (CLI-Specific) COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific) az rest audience, shell escaping, token expiry
Quick Reference: az rest Template COMMON-CLI.md § Quick Reference: az rest Template
Quick Reference: Token Audience / CLI Tool Matrix COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix Which --resource + tool for each service

3. Spark Sessions, Notebooks & Jobs (background)

Task Reference Notes
Livy Session Management SPARK-CONSUMPTION-CORE.md § Livy Session Management Session creation, states, lifecycle, termination
Interactive Data Exploration SPARK-CONSUMPTION-CORE.md § Interactive Data Exploration Statement execution, output retrieval, data discovery
Notebook Execution & Job Management SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management

4. Spark Monitoring APIs (primary triage surface)

Task Reference Notes
Spark Monitoring API Overview SPARK-MONITORING-CORE.md § Overview GA monitoring APIs — no active session required
Workspace & Item Session Listing SPARK-MONITORING-CORE.md § Workspace and Item-Level Session Listing List Spark apps across workspace with filtering
Spark Advisor API SPARK-MONITORING-CORE.md § Spark Advisor API Key — automated skew detection, task errors, recommendations
Open-Source Spark History Server APIs SPARK-MONITORING-CORE.md § Open-Source Spark History Server APIs Jobs, stages, executors, SQL queries via REST
Driver and Executor Log APIs SPARK-MONITORING-CORE.md § Driver and Executor Log APIs Direct log retrieval without active session
Livy Log API SPARK-MONITORING-CORE.md § Livy Log API Session-level log with byte-offset pagination
Resource Usage API SPARK-MONITORING-CORE.md § Resource Usage API vCore timeline, idle/running cores, efficiency metrics
Monitoring Diagnostic Workflow SPARK-MONITORING-CORE.md § Diagnostic Workflow Using Monitoring APIs Step-by-step triage using monitoring APIs

5. Diagnostic Workflows (start here for active triage)

Task Reference Notes
Automated Diagnostic Workflow (full) automated-diagnostic-workflow.md Steps 1–7: resolve → route by state → failure/perf/resource/health → report. Includes Step 1b expired-data fallback and report templates
Diagnostic Tiers diagnostic-workflow.md § Diagnostic Tiers Tier 1 (online REST) vs Tier 2 (local SHS)
Key Diagnostic Patterns diagnostic-workflow.md § Key Diagnostic Patterns Symptom → first check → likely cause lookup
Severity Thresholds diagnostic-workflow.md § Severity Thresholds Metric thresholds for classifying findings
Manual CLI Recipes diagnostic-workflow.md § Manual CLI Recipes Ad-hoc diagnostic commands for manual use
Pipeline Run Diagnosis pipeline-diagnosis.md Diagnose all Spark activities within a pipeline run (Steps P1–P6)

6. Job Failure Diagnostics

Task Reference Notes
Failure Triage Workflow job-diagnostics.md § Failure Triage Workflow Step-by-step decision tree for diagnosing failures
Job Failure Classification job-diagnostics.md § Failure Classification OOM, shuffle, timeout, dependency, configuration errors
Reading Spark Logs via REST job-diagnostics.md § Reading Spark Logs via REST Driver/executor log retrieval from Livy
Job Instance History job-diagnostics.md § Job Instance History Query recent runs, compare durations, detect regressions

7. Livy Session Health

Task Reference Notes
Session Health Assessment session-health.md § Livy Session Lifecycle Session states, transitions, expected durations
Idle and Zombie Session Detection session-health.md § Idle and Zombie Session Detection Find and clean up leaked sessions
Session Resource Monitoring session-health.md § Session Resource Monitoring Memory and executor usage via Livy
Session Recovery Patterns session-health.md § Session Recovery Patterns Restart strategies and session replacement

8. Performance Diagnostics

Task Reference Notes
Performance Anti-Patterns performance-patterns.md § Anti-Patterns Spill, shuffle, skew, small files, collect misuse
Stage and Task Analysis performance-patterns.md § Stage and Task Analysis Reading Spark UI metrics via REST
Optimization Recipes performance-patterns.md § Optimization Recipes Partition tuning, broadcast joins, caching
Capacity and Resource Diagnostics performance-patterns.md § Capacity and Resource Diagnostics CU consumption, throttling detection

9. Offline / Deep-Dive Tools

Task Reference Notes
JobInsight Event Log Copy jobinsight-api.md § LogUtils.copyEventLog Copy event logs from Fabric to OneLake for offline analysis
Local Spark History Server spark-history-server.md § Overview Start local SHS for full Spark UI (DAG, tasks, SQL plans)

Must/Prefer/Avoid

MUST DO

  • Always retrieve job/session status before attempting remediation
  • Use workspace and item discovery from COMMON-CLI.md — never hardcode IDs
  • Check Livy session state before submitting diagnostic statements
  • Follow the Failure Triage Workflow for systematic diagnosis
  • Always check the Spark Advisor API before reading raw logs — it often identifies the root cause immediately
  • Use monitoring APIs (no active session required) before attempting Livy-based diagnostics
  • Poll job/session status with 10–30 second intervals; timeout diagnostics after 30 minutes
  • Always include the Notebook Snapshot URL in diagnostic output — it has the longest retention and enables cell-level inspection in the Fabric UI

PREFER

  • Querying job instance history to establish baseline before declaring a regression
  • Reusing existing idle sessions for diagnostic queries instead of creating new ones
  • Checking capacity utilization when jobs are slow before blaming the Spark code
  • Using az rest with JMESPath filtering to extract specific fields from large API responses
  • The Spark Advisor API over manual log parsing for skew, task errors, and timeout detection
  • Resource Usage API coreEfficiency metric to quantify cluster utilization before recommending scaling
  • Job instance history comparison (last 5 runs) to detect regressions before deep-diving

AVOID

  • Killing sessions without checking if they have active statements
  • Creating new sessions for every diagnostic query (reuse idle sessions)
  • Assuming OOM without checking actual memory metrics from Livy
  • Hardcoded workspace or item IDs in diagnostic scripts
  • Diagnosing performance without first checking capacity throttling via the Admin API
  • Submitting diagnostic statements to sessions in busy state

Examples

Example 1: Diagnose a Failed Notebook

User prompt: "Why did my notebook ETL_Daily fail in workspace Production?"

Agent workflow:

  1. Resolves workspace → workspaceId, item → itemId (Notebook)
  2. Lists recent Livy sessions, auto-picks the Failed session
  3. Queries Spark Advisor → finds TaskError: OutOfMemoryError on executor
  4. Queries /stages → confirms data skew (12× max/median ratio in stage 5)
  5. Presents report with HIGH findings + fix recommendations

Example 2: Triage Stuck Livy Session

User prompt: "My Livy session abc-1234 is stuck in starting state"

Agent workflow:

  1. Uses session ID directly, queries session state
  2. Lists all workspace sessions → detects 8 concurrent sessions (capacity pressure)
  3. Checks Livy log → no errors, just queued
  4. Reports: capacity contention, recommends waiting or cancelling idle sessions

Example 3: Pipeline Failure Root Cause

User prompt: "Diagnose pipeline run 5678 in workspace Analytics"

Agent workflow:

  1. Resolves pipeline, calls queryActivityRuns for run 5678
  2. Finds 2 Notebook activities: one Succeeded, one Failed
  3. Extracts output.result.error.{ename, evalue, traceback} from failed activity
  4. Constructs Notebook Snapshot URL for cell-level inspection
  5. Presents error details + snapshot link + suggested fix

Quick Start

Environment Setup

Apply environment detection from COMMON-CLI.md to set:

  • $FABRIC_API_BASE and $FABRIC_RESOURCE_SCOPE
  • $FABRIC_API_URL and $LIVY_API_PATH for Livy operations

Authentication: Use token acquisition from COMMON-CLI.md § Authentication Recipes.


Automated Diagnostic Workflow

When the user provides a simple prompt (e.g., "Diagnose my notebook ETL_Pipeline", "What's wrong with Spark application abc-123", "Check workspace Production for issues"), follow this fast-path summary. For full procedure, edge cases (expired data, pipeline-only sessions), report templates, and retention details, see references/automated-diagnostic-workflow.md.

Entry Points (what the user provides)

User provides Agent resolves
Workspace name workspaceId (via workspace list + name filter)
Notebook / SJD / Lakehouse name itemId (via item list + name/type filter)
Pipeline name + run ID → child Spark activities → see pipeline-diagnosis.md
Livy session ID or Spark app ID → Use directly
Nothing specific → Ask for workspace name + item name

Item-Type API Paths

Item Type Livy Sessions Path Job Instances Path
Notebook /notebooks/{id}/livySessions /items/{id}/jobs/instances
Spark Job Definition /sparkJobDefinitions/{id}/livySessions /items/{id}/jobs/instances
Lakehouse /lakehouses/{id}/livySessions /lakehouses/{id}/jobs/instances

All session API paths follow: $FABRIC_API_URL/workspaces/$workspaceId/<itemTypePath>/$itemId/livySessions/$livyId/applications/$appId/<endpoint> — see SPARK-MONITORING-CORE.md.

Steps at a Glance

Step When Action Auto-flag rule
1. Resolve & Discover Always Resolve workspace → item → list recent Livy sessions; auto-pick if unambiguous, else prompt user
1b. Fallback Session 404 / Spark Monitoring data expired Try queryActivityRuns (pipeline) → Job Instance failureReason → construct Notebook Snapshot URL See reference § Step 1b
2. Route by state After Step 1 Failed → 3+4+5 · Succeeded/InProgress → 4+5 · Cancelled → log+3 · idle/busy/starting → 6 · dead/killed/error → 3+6
3. Failure analysis Failed / Cancelled / dead Query in order: Spark Advisor → driver stderr → Job Instance → executor logs → Livy log → Resource Usage. Stop when root cause clear. Match against job-diagnostics.md § Quick Reference Table
4. Performance Always (except 1b path) /stages, /allexecutors skew max/median > 3× · spill diskBytesSpilled > 0 · GC jvmGcTime/executorRunTime > 20% · shuffle > 1 GB · tasks < 100ms
5. Resource utilization Always (except 1b path) /resourceUsage coreEfficiency < 0.3 → HIGH · idleTime/duration > 0.4 → MEDIUM
6. Session health Idle/zombie checks GET /workspaces/$workspaceId/spark/livySessions idle + no recent statements → zombie · starting beyond expected → capacity
7. Compile report Final Severity-ordered findings table + Notebook Snapshot link + suggested fixes See reference § Step 7 for template

Key principle: Always check Spark Advisor first — it's pre-computed and identifies most root causes without log parsing. Pipeline runs have the richest error data via queryActivityRuns (ename, evalue, traceback, cell/line) — see pipeline-diagnosis.md.

Data retention warning: Spark Monitoring API data (logs, stages, advisor) typically expires in minutes to hours after session end. Diagnose failures promptly. If APIs return 404, jump to Step 1b in the reference.

Tier 2 escalation: For truncated data, HTTP 408/504, or DAG/SQL plan visualization, suggest the offline Spark History Server workflow.

Install via CLI
npx skills add https://github.com/microsoft/skills-for-fabric --skill spark-operations-cli
Repository Details
star Stars 522
call_split Forks 131
navigation Branch main
article Path SKILL.md
More from Creator