spark-operations-cli

name: spark-operations-cli description: > Diagnose failed Spark jobs, unhealthy Livy sessions, and performance bottlenecks in Microsoft Fabric via read-only CLI triage. Use when the user wants to: (1) diagnose why a Spark job, notebook run, or Lakehouse job failed, (2) triage stuck or dead Livy sessions, (3) identify OOM, shuffle spill, or data skew, (4) retrieve driver and executor logs or Spark Advisor findings, (5) copy event logs and start a local Spark History Server, (6) diagnose all Spark activities within a failed pipeline run. Triggers: "diagnose my failed notebook", "why did my spark job fail", "triage spark failure", "diagnose pipeline run failure", "why did my pipeline fail", "livy session stuck in starting", "spark executor OOM", "check spark advisor findings", "shuffle spill diagnosis", "why did my lakehouse job fail", "diagnose lakehouse table load", "data skew diagnosis", "open spark history server locally", "analyze spark failure logs", "spark job triage".

Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.

GitHub Copilot CLI / VS Code: invoke the check-updates skill.

Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.

Skip if the check was already performed earlier in this session.

CRITICAL NOTES

To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering

To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering

Skill disambiguation: spark-operations-cli is for read-only triage and diagnosis of existing jobs and sessions. For creating notebooks, running new jobs, or Spark development, use spark-authoring-cli. For interactive PySpark analysis and Livy session creation, use spark-consumption-cli.

Spark Operations — CLI Skill

This skill provides diagnostics for Microsoft Fabric Spark job failures, Livy session health, and performance bottlenecks using Fabric REST APIs and CLI tools (az rest). All diagnostic operations are read-only; session cleanup (e.g., stopping zombie sessions) requires explicit user confirmation. For Spark development and notebook authoring, use spark-authoring-cli. For interactive PySpark analysis, use spark-consumption-cli.

The TOC is grouped by purpose. Start at Diagnostic Workflows when triaging an active failure; the earlier sections are foundational references.

1. Fabric Foundations (concepts)

Task	Reference	Notes
Fabric Topology & Key Concepts	COMMON-CORE.md § Fabric Topology & Key Concepts
Environment URLs	COMMON-CORE.md § Environment URLs
Authentication & Token Acquisition	COMMON-CORE.md § Authentication & Token Acquisition	Wrong audience = 401; read before any auth issue
Core Control-Plane REST APIs	COMMON-CORE.md § Core Control-Plane REST APIs
Pagination	COMMON-CORE.md § Pagination
Long-Running Operations (LRO)	COMMON-CORE.md § Long-Running Operations (LRO)
Rate Limiting & Throttling	COMMON-CORE.md § Rate Limiting & Throttling
Job Execution	COMMON-CORE.md § Job Execution
Capacity Management	COMMON-CORE.md § Capacity Management
Gotchas & Troubleshooting	COMMON-CORE.md § Gotchas & Troubleshooting
Best Practices	COMMON-CORE.md § Best Practices

2. CLI Setup & Authentication

Task	Reference	Notes
Tool Selection Rationale	COMMON-CLI.md § Tool Selection Rationale
Finding Workspaces and Items in Fabric	COMMON-CLI.md § Finding Workspaces and Items in Fabric	Mandatory — READ link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id]
Authentication Recipes	COMMON-CLI.md § Authentication Recipes	`az login` flows and token acquisition
Fabric Control-Plane API via `az rest`	COMMON-CLI.md § Fabric Control-Plane API via az rest	Always pass `--resource https://api.fabric.microsoft.com` or `az rest` fails
Pagination Pattern	COMMON-CLI.md § Pagination Pattern
Long-Running Operations (LRO) Pattern	COMMON-CLI.md § Long-Running Operations (LRO) Pattern
Gotchas & Troubleshooting (CLI-Specific)	COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific)	`az rest` audience, shell escaping, token expiry
Quick Reference: `az rest` Template	COMMON-CLI.md § Quick Reference: az rest Template
Quick Reference: Token Audience / CLI Tool Matrix	COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix	Which `--resource` + tool for each service

3. Spark Sessions, Notebooks & Jobs (background)

Task	Reference	Notes
Livy Session Management	SPARK-CONSUMPTION-CORE.md § Livy Session Management	Session creation, states, lifecycle, termination
Interactive Data Exploration	SPARK-CONSUMPTION-CORE.md § Interactive Data Exploration	Statement execution, output retrieval, data discovery
Notebook Execution & Job Management	SPARK-AUTHORING-CORE.md § Notebook Execution & Job Management

4. Spark Monitoring APIs (primary triage surface)

Task	Reference	Notes
Spark Monitoring API Overview	SPARK-MONITORING-CORE.md § Overview	GA monitoring APIs — no active session required
Workspace & Item Session Listing	SPARK-MONITORING-CORE.md § Workspace and Item-Level Session Listing	List Spark apps across workspace with filtering
Spark Advisor API	SPARK-MONITORING-CORE.md § Spark Advisor API	Key — automated skew detection, task errors, recommendations
Open-Source Spark History Server APIs	SPARK-MONITORING-CORE.md § Open-Source Spark History Server APIs	Jobs, stages, executors, SQL queries via REST
Driver and Executor Log APIs	SPARK-MONITORING-CORE.md § Driver and Executor Log APIs	Direct log retrieval without active session
Livy Log API	SPARK-MONITORING-CORE.md § Livy Log API	Session-level log with byte-offset pagination
Resource Usage API	SPARK-MONITORING-CORE.md § Resource Usage API	vCore timeline, idle/running cores, efficiency metrics
Monitoring Diagnostic Workflow	SPARK-MONITORING-CORE.md § Diagnostic Workflow Using Monitoring APIs	Step-by-step triage using monitoring APIs

5. Diagnostic Workflows (start here for active triage)

Task	Reference	Notes
Automated Diagnostic Workflow (full)	automated-diagnostic-workflow.md	Steps 1–7: resolve → route by state → failure/perf/resource/health → report. Includes Step 1b expired-data fallback and report templates
Diagnostic Tiers	diagnostic-workflow.md § Diagnostic Tiers	Tier 1 (online REST) vs Tier 2 (local SHS)
Key Diagnostic Patterns	diagnostic-workflow.md § Key Diagnostic Patterns	Symptom → first check → likely cause lookup
Severity Thresholds	diagnostic-workflow.md § Severity Thresholds	Metric thresholds for classifying findings
Manual CLI Recipes	diagnostic-workflow.md § Manual CLI Recipes	Ad-hoc diagnostic commands for manual use
Pipeline Run Diagnosis	pipeline-diagnosis.md	Diagnose all Spark activities within a pipeline run (Steps P1–P6)

6. Job Failure Diagnostics

Task	Reference	Notes
Failure Triage Workflow	job-diagnostics.md § Failure Triage Workflow	Step-by-step decision tree for diagnosing failures
Job Failure Classification	job-diagnostics.md § Failure Classification	OOM, shuffle, timeout, dependency, configuration errors
Reading Spark Logs via REST	job-diagnostics.md § Reading Spark Logs via REST	Driver/executor log retrieval from Livy
Job Instance History	job-diagnostics.md § Job Instance History	Query recent runs, compare durations, detect regressions

7. Livy Session Health

Task	Reference	Notes
Session Health Assessment	session-health.md § Livy Session Lifecycle	Session states, transitions, expected durations
Idle and Zombie Session Detection	session-health.md § Idle and Zombie Session Detection	Find and clean up leaked sessions
Session Resource Monitoring	session-health.md § Session Resource Monitoring	Memory and executor usage via Livy
Session Recovery Patterns	session-health.md § Session Recovery Patterns	Restart strategies and session replacement

8. Performance Diagnostics

Task	Reference	Notes
Performance Anti-Patterns	performance-patterns.md § Anti-Patterns	Spill, shuffle, skew, small files, collect misuse
Stage and Task Analysis	performance-patterns.md § Stage and Task Analysis	Reading Spark UI metrics via REST
Optimization Recipes	performance-patterns.md § Optimization Recipes	Partition tuning, broadcast joins, caching
Capacity and Resource Diagnostics	performance-patterns.md § Capacity and Resource Diagnostics	CU consumption, throttling detection

9. Offline / Deep-Dive Tools

Task	Reference	Notes
JobInsight Event Log Copy	jobinsight-api.md § LogUtils.copyEventLog	Copy event logs from Fabric to OneLake for offline analysis
Local Spark History Server	spark-history-server.md § Overview	Start local SHS for full Spark UI (DAG, tasks, SQL plans)

Must/Prefer/Avoid

MUST DO

Always retrieve job/session status before attempting remediation
Use workspace and item discovery from COMMON-CLI.md — never hardcode IDs
Check Livy session state before submitting diagnostic statements
Follow the Failure Triage Workflow for systematic diagnosis
Always check the Spark Advisor API before reading raw logs — it often identifies the root cause immediately
Use monitoring APIs (no active session required) before attempting Livy-based diagnostics
Poll job/session status with 10–30 second intervals; timeout diagnostics after 30 minutes
Always include the Notebook Snapshot URL in diagnostic output — it has the longest retention and enables cell-level inspection in the Fabric UI

PREFER

Querying job instance history to establish baseline before declaring a regression
Reusing existing idle sessions for diagnostic queries instead of creating new ones
Checking capacity utilization when jobs are slow before blaming the Spark code
Using az rest with JMESPath filtering to extract specific fields from large API responses
The Spark Advisor API over manual log parsing for skew, task errors, and timeout detection
Resource Usage API coreEfficiency metric to quantify cluster utilization before recommending scaling
Job instance history comparison (last 5 runs) to detect regressions before deep-diving

AVOID

Killing sessions without checking if they have active statements
Creating new sessions for every diagnostic query (reuse idle sessions)
Assuming OOM without checking actual memory metrics from Livy
Hardcoded workspace or item IDs in diagnostic scripts
Diagnosing performance without first checking capacity throttling via the Admin API
Submitting diagnostic statements to sessions in busy state

Examples

Example 1: Diagnose a Failed Notebook

User prompt: "Why did my notebook ETL_Daily fail in workspace Production?"

Agent workflow:

Resolves workspace → workspaceId, item → itemId (Notebook)
Lists recent Livy sessions, auto-picks the Failed session
Queries Spark Advisor → finds TaskError: OutOfMemoryError on executor
Queries /stages → confirms data skew (12× max/median ratio in stage 5)
Presents report with HIGH findings + fix recommendations

Example 2: Triage Stuck Livy Session

User prompt: "My Livy session abc-1234 is stuck in starting state"

Agent workflow:

Uses session ID directly, queries session state
Lists all workspace sessions → detects 8 concurrent sessions (capacity pressure)
Checks Livy log → no errors, just queued
Reports: capacity contention, recommends waiting or cancelling idle sessions

Example 3: Pipeline Failure Root Cause

User prompt: "Diagnose pipeline run 5678 in workspace Analytics"

Agent workflow:

Resolves pipeline, calls queryActivityRuns for run 5678
Finds 2 Notebook activities: one Succeeded, one Failed
Extracts output.result.error.{ename, evalue, traceback} from failed activity
Constructs Notebook Snapshot URL for cell-level inspection
Presents error details + snapshot link + suggested fix

Quick Start

Environment Setup

Apply environment detection from COMMON-CLI.md to set:

$FABRIC_API_BASE and $FABRIC_RESOURCE_SCOPE
$FABRIC_API_URL and $LIVY_API_PATH for Livy operations

Authentication: Use token acquisition from COMMON-CLI.md § Authentication Recipes.

Automated Diagnostic Workflow

When the user provides a simple prompt (e.g., "Diagnose my notebook ETL_Pipeline", "What's wrong with Spark application abc-123", "Check workspace Production for issues"), follow this fast-path summary. For full procedure, edge cases (expired data, pipeline-only sessions), report templates, and retention details, see references/automated-diagnostic-workflow.md.

Entry Points (what the user provides)

User provides	Agent resolves
Workspace name	→ `workspaceId` (via workspace list + name filter)
Notebook / SJD / Lakehouse name	→ `itemId` (via item list + name/type filter)
Pipeline name + run ID	→ child Spark activities → see pipeline-diagnosis.md
Livy session ID or Spark app ID	→ Use directly
Nothing specific	→ Ask for workspace name + item name

Item-Type API Paths

Item Type	Livy Sessions Path	Job Instances Path
Notebook	`/notebooks/{id}/livySessions`	`/items/{id}/jobs/instances`
Spark Job Definition	`/sparkJobDefinitions/{id}/livySessions`	`/items/{id}/jobs/instances`
Lakehouse	`/lakehouses/{id}/livySessions`	`/lakehouses/{id}/jobs/instances`

All session API paths follow: $FABRIC_API_URL/workspaces/$workspaceId/<itemTypePath>/$itemId/livySessions/$livyId/applications/$appId/<endpoint> — see SPARK-MONITORING-CORE.md.

Steps at a Glance

Step	When	Action	Auto-flag rule
1. Resolve & Discover	Always	Resolve workspace → item → list recent Livy sessions; auto-pick if unambiguous, else prompt user	—
1b. Fallback	Session 404 / Spark Monitoring data expired	Try `queryActivityRuns` (pipeline) → Job Instance `failureReason` → construct Notebook Snapshot URL	See reference § Step 1b
2. Route by state	After Step 1	`Failed` → 3+4+5 · `Succeeded`/`InProgress` → 4+5 · `Cancelled` → log+3 · `idle`/`busy`/`starting` → 6 · `dead`/`killed`/`error` → 3+6	—
3. Failure analysis	Failed / Cancelled / dead	Query in order: Spark Advisor → driver stderr → Job Instance → executor logs → Livy log → Resource Usage. Stop when root cause clear.	Match against job-diagnostics.md § Quick Reference Table
4. Performance	Always (except 1b path)	`/stages`, `/allexecutors`	skew `max/median > 3×` · spill `diskBytesSpilled > 0` · GC `jvmGcTime/executorRunTime > 20%` · shuffle `> 1 GB` · tasks `< 100ms`
5. Resource utilization	Always (except 1b path)	`/resourceUsage`	`coreEfficiency < 0.3` → HIGH · `idleTime/duration > 0.4` → MEDIUM
6. Session health	Idle/zombie checks	`GET /workspaces/$workspaceId/spark/livySessions`	`idle` + no recent statements → zombie · `starting` beyond expected → capacity
7. Compile report	Final	Severity-ordered findings table + Notebook Snapshot link + suggested fixes	See reference § Step 7 for template

Key principle: Always check Spark Advisor first — it's pre-computed and identifies most root causes without log parsing. Pipeline runs have the richest error data via queryActivityRuns (ename, evalue, traceback, cell/line) — see pipeline-diagnosis.md.

Data retention warning: Spark Monitoring API data (logs, stages, advisor) typically expires in minutes to hours after session end. Diagnose failures promptly. If APIs return 404, jump to Step 1b in the reference.

Tier 2 escalation: For truncated data, HTTP 408/504, or DAG/SQL plan visualization, suggest the offline Spark History Server workflow.

Spark Operations — CLI Skill

Table of Contents

1. Fabric Foundations (concepts)

2. CLI Setup & Authentication

3. Spark Sessions, Notebooks & Jobs (background)

4. Spark Monitoring APIs (primary triage surface)

5. Diagnostic Workflows (start here for active triage)

6. Job Failure Diagnostics

7. Livy Session Health

8. Performance Diagnostics

9. Offline / Deep-Dive Tools

Must/Prefer/Avoid

MUST DO

PREFER

AVOID

Examples

Example 1: Diagnose a Failed Notebook

Example 2: Triage Stuck Livy Session

Example 3: Pipeline Failure Root Cause

Quick Start

Environment Setup

Automated Diagnostic Workflow

Entry Points (what the user provides)

Item-Type API Paths

Steps at a Glance