name: gcp-spark description: | Develops and executes Spark code on Dataproc Clusters and Serverless. Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner. Debugs execution failures. Use when: - Writing Spark ETL pipelines on GCP. - Training or running inference with ML models with spark on GCP. - Managing Spark clusters, jobs, batches, and interactive sessions. Don't use when: - Writing generic Python scripts that don't use Spark. - Performing simple SQL queries that can be done directly in BigQuery. license: Apache-2.0 metadata: version: v2 publisher: google
Spark on Dataproc
[!IMPORTANT]
You MUST ALWAYS follow the Task Execution Workflow when writing spark code.
Task Execution Workflow
- Understand schemas: ALWAYS use
@skill:discovering-gcp-data-assetsskill orreferences/schema_direct_inspection.mdto understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names. - Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to
references/read_write_data.mdwhen reading or writing data. - ML Tasks: Refer to
@skill:ml-best-practicesskill andreferences/ml_tasks.mdwhen generating ML code. - Spark Optimizations: ALWAYS refer to
references/spark_optimizations.mdwhen generating spark code and apply optimization whenever applicable.
- Verify schema before write: ALWAYS verify that the dataframe and
destination schema match, use
df.printSchema()for dataframe schema and refer to@skill:discovering-gcp-data-assetsskill orreferences/schema_direct_inspection.mdto verify destination schema. - Compile code before executing: For notebooks convert them to python
script using
jupyter nbconvert --to script your-notebook.ipynbfirst, then compile code usingpython3 -m py_compile your-notebook.py. - Execute script: ONLY when generating a
.pyscript refer toreferences/gcloud_dataproc.mdon writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.
Common Mistakes Checklist
[!CAUTION]
Ensure you verify this checklist to avoid mistakes
Before submitting a job, verify:
- All imports present (
col,when,lit, etc. frompyspark.sql.functions) -
vector_to_arrayfrom correct module usefrom pyspark.ml.functions import vector_to_array(NOTpyspark.sql.functions) - DataFrame schema matches target Iceberg table verify with
df.printSchema()before writing - CSV files read with
headerandinferSchemawithout these, the header row becomes data and all columns are strings - Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5
IAM Requirements
The Dataproc service account needs:
roles/dataproc.worker: Job executionroles/biglake.admin: Iceberg table managementroles/bigquery.jobUser: Query materializationroles/storage.objectUser: Read/write GCSroles/spanner.databaseUser: Spanner writes
Spark resource management
Refer to references/gcloud_dataproc.md for detailed guidelines on managing
Spark clusters, jobs, batches, and interactive sessions.