name: scala-data-engineering-on-jvm-runtimes description: Guides agents through Scala-based data engineering on JVM runtimes. Use when building Spark, Flink, Kafka Streams, or other Scala data jobs that require explicit build, packaging, type, and runtime discipline.
Scala Data Engineering On JVM Runtimes
Overview
Use this skill when Scala is the implementation language for distributed or streaming data systems. It helps agents manage build compatibility, packaging, JVM dependency issues, typed data models, serialization behavior, and runtime assumptions that often cause production failures long after code compiles.
When to Use
- building or modifying
ScalaSparkjobs - implementing
Flink,Kafka Streams, or JVM-native data processors inScala - managing
sbtbuilds, shaded JARs, or runtime compatibility - choosing typed datasets, encoders, or JVM serialization strategies
- debugging classpath, version, or packaging problems in distributed runtimes
Do not assume compile success means distributed runtime safety.
Workflow
Define the runtime and compatibility surface. Include:
- engine and version
Scalaversion- JVM level
- cluster or container runtime
- connector and library compatibility
Shape the data model intentionally. Choose:
- typed case classes or schemas
- serialization strategy
- encoder behavior
- where UDFs are truly needed
- how nulls and optional fields are represented
Package for the real deployment target. Decide:
sbtor other build surface- fat-jar or shaded-jar approach
- dependency conflict handling
- resource and config loading behavior
- how the job is launched and parameterized
Design for distributed execution, not driver-local convenience. Check:
- partition behavior
- skew
- state growth
- checkpoint or savepoint needs
- accidental driver-side collection or closure capture
Validate with runtime realism. Require:
- representative local or test-cluster execution
- packaging verification
- schema and compatibility checks
- observability and failure handling expectations
Common Rationalizations
| Rationalization | Reality |
|---|---|
| "It compiles, so the job is fine." | Distributed classpath, serialization, and dependency issues often appear only at runtime. |
| "We can fix the JAR if deployment fails." | Packaging problems discovered at deploy time slow delivery and often hide deeper compatibility issues. |
| "A quick UDF is simpler." | Overusing UDFs can hide schema, optimizer, and performance problems in JVM data engines. |
Red Flags
Scala, engine, and connector versions are not pinned together intentionally- driver-local collections or closures leak into distributed execution
- shading or dependency conflicts are discovered only after deployment
- type or null behavior is unclear across schemas and case classes
- runtime config is embedded in code instead of explicit launch parameters
Verification
- Runtime,
Scala, and library compatibility is explicit - Packaging strategy matches the target execution environment
- Data model and serialization behavior are intentional
- Distributed execution risks are reviewed beyond compile success