gcp-spark - SKILL.md Agent Skill

name: gcp-spark description: | Develops and executes Spark code on Dataproc Clusters and Serverless. Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner. Debugs execution failures. Use when: - Writing Spark ETL pipelines on GCP. - Training or running inference with ML models with spark on GCP. - Managing Spark clusters, jobs, batches, and interactive sessions. Don't use when: - Writing generic Python scripts that don't use Spark. - Performing simple SQL queries that can be done directly in BigQuery. license: Apache-2.0 metadata: version: v2 publisher: google

[!IMPORTANT]

You MUST ALWAYS follow the Task Execution Workflow when writing spark code.

Understand schemas: ALWAYS use @skill:discovering-gcp-data-assets skill or references/schema_direct_inspection.md to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to references/read_write_data.md when reading or writing data.
- ML Tasks: Refer to @skill:ml-best-practices skill and references/ml_tasks.md when generating ML code.
- Spark Optimizations: ALWAYS refer to references/spark_optimizations.md when generating spark code and apply optimization whenever applicable.
Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use df.printSchema() for dataframe schema and refer to @skill:discovering-gcp-data-assets skill or references/schema_direct_inspection.md to verify destination schema.
Compile code before executing: For notebooks convert them to python script using jupyter nbconvert --to script your-notebook.ipynb first, then compile code using python3 -m py_compile your-notebook.py.
Execute script: ONLY when generating a .py script refer to references/gcloud_dataproc.md on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

[!CAUTION]

Ensure you verify this checklist to avoid mistakes

Before submitting a job, verify:

All imports present (col, when, lit, etc. from pyspark.sql.functions)
vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)
DataFrame schema matches target Iceberg table verify with df.printSchema() before writing
CSV files read with header and inferSchema without these, the header row becomes data and all columns are strings
Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

The Dataproc service account needs:

Refer to references/gcloud_dataproc.md for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.