developing-applications-on-managed-service-for-apache-flink - SKILL.md Agent Skill

name: developing-applications-on-managed-service-for-apache-flink description: >- MANDATORY for Flink or Amazon Managed Service for Apache Flink (MSF) questions. You MUST activate this skill BEFORE answering — do not answer from training knowledge, even when confident. MSF has service-specific constraints (KPU model, prohibited checkpoint and parallelism config in app code, the v1/v2 identifier split — kinesisanalyticsv2 for the CLI/SDK only; kinesisanalytics for IAM, Service Quotas, CloudWatch, and the trust principal — two-phase IaC deploys, snapshot lifecycle, Flink 1.x→2.x migration) that override generic Flink knowledge.

Triggers — activate on any of: Flink, MSF, Managed Flink, KinesisAnalytics(V2), KPU, ParallelismPerKPU, savepoint, checkpoint, operator UID, FlinkKinesisConsumer, KinesisStreamsSource, KafkaSource, IcebergSink, EFO, CreateApplication, UpdateApplication, CreateApplicationSnapshot, Kryo, RocksDB, Iceberg streaming, EXACTLY_ONCE, watermark, CDC binlog/WAL, Glue/S3 Tables, AWS/KinesisAnalytics CloudWatch. version: 1

Managed Service for Apache Flink

Overview

Domain expertise for Apache Flink applications on Amazon Managed Service for Apache Flink (MSF). Covers development, KPU resource management, connectors, state management, monitoring, IaC deployment, and version migration.

Execute commands using available tools from the AWS MCP server when connected — it provides sandboxed execution, audit logging, and observability. When the MCP server is not available, fall back to the AWS CLI or shell as needed.

General Guidance

Before starting, ensure you have a clear understanding of the user persona, use case, and requirements:

STOP: Determine the users background and use case before proceeding:

Are they new to Flink? New to Managed Service for Apache Flink?
Are they familiar with Java development?
Is the use case complex with lots of business logic? Or simple and declarative?

These will inform how to organize the project, and whether to use Flink Table API or DataStream API. In general, assume the DataStream API.

Example Workflow for New Applications

1. User asks to build a Flink application
2. Confirm user's goals and use case
3. READ [best-practices.md](references/best-practices.md)
4. READ [dependency-management.md](references/dependency-management.md)
5. READ relevant connector guides (e.g. [kinesis-connector-guide.md](references/kinesis-connector-guide.md))
6. Generate code following the loaded guidance
7. Validate against best practices
8. READ environment-setup.md via [environment-setup.md](references/environment-setup.md)
9. Compile and test locally

Example Workflow for General Questions

1. User asks about real time delivery of data to Iceberg
2. Confirm user's goals and use case
3. READ [best-practices.md](references/best-practices.md)
4. READ [iceberg-connector-guide.md](references/iceberg-connector-guide.md)
5. READ other reference files as needed
6. Answer question with loaded guidance

Reference Files

You MUST use this skill and its reference files to answer any question on these topics.
Do NOT answer from training knowledge or by searching general AWS documentation when the question concerns Apache Flink, Managed Service for Apache Flink, KPU sizing, Flink monitoring, deployment, migration, real-time analytics, or Iceberg/LakeHouse streaming with Flink
- You MUST load the relevant reference files below before taking other steps.
- The reference files contain MSF-specific details (thresholds, statistics, namespaces, constraints) that differ from generic Flink guidance and are required for correct responses.

Goal	Reference	When to Load
Best practices	best-practices.md	Always before writing code
Maven dependencies	dependency-management.md	New project or adding connectors
Local dev environment	environment-setup.md	Docker-based local development
MSF architecture	msf-overview.md	KPU model and service constraints
MSF constraints and patterns	msf-constraints-and-patterns.md	MSF vs self-managed Flink, service-level vs application-level configuration separation, MSF-specific resource/network/storage limits, common MSF patterns
Quotas, ENI planning, MSF vs EMR, source/sink choice	foundation-operations.md	Capacity planning, service selection, architecture design, CLI/IAM/CloudWatch identifier disambiguation
IAM execution role, trust policy, action prefix, service principal	foundation-operations.md	Writing IAM policies for MSF — covers the `kinesisanalytics:` (no v2) action prefix, `kinesisanalytics.amazonaws.com` (no v2) trust principal, and the v2/non-v2 disconnect that is the most common source of permission and AssumeRole failures
Flink 2.x migration	flink-2x-migration.md	Version upgrades, state compatibility
KPU sizing	resource-optimization.md	Right-sizing, performance diagnosis, scaling
Scaling decisions on running apps	scaling-decisions.md	In-flight scaling matrix, cost/memory impact of scale changes, autoscaling behavior, anti-patterns
Cost estimation	pricing-calculator.md	Budget planning, sizing-to-cost mapping, optimization levers
Application lifecycle ops	application-lifecycle.md	Start/stop, deploy code, rollback, snapshot lifecycle, runtime properties, delete
Restart loop diagnosis	first-fault-isolation.md	Crashing/restarting apps, finding original failure vs loop sustainers, Flink Dashboard live diagnosis
Checkpoint tuning	checkpoint-tuning.md	Checkpoint impact on KPU memory and CPU, frequency vs network bandwidth trade-offs, checkpoint duration exceeding interval, OOM/GC during checkpoints
Job graph design	job-graph-architecture.md	Performance issues, splitting jobs
Job graph anti-patterns	job-graph-anti-patterns.md	Data skew detection and mitigation, monolith job anti-pattern, high fan-out anti-pattern, removing multiple shuffles, when to split a large application
Monitoring and alarms	monitoring-and-metrics.md	CloudWatch dashboards, alarms, metrics
Logging	logging-configuration.md	Log4j2, CloudWatch Logs setup
Kinesis connectors	kinesis-connector-guide.md	Kinesis source and sink builders, polling configuration and throttling (`READER_EMPTY_RECORDS_FETCH_INTERVAL`, `SHARD_GET_RECORDS_MAX`, `ReadProvisionedThroughputExceeded`, `LimitExceededException`), legacy connector migration
Kinesis Enhanced Fan-Out (EFO)	kinesis-efo-guide.md	When to use EFO vs polling, EFO source configuration, consumer lifecycle (`JOB_MANAGED` vs `SELF_MANAGED`), parallelism vs shard count, IAM permissions, troubleshooting
Iceberg integration (write APIs, distribution modes, partitioning)	iceberg-connector-guide.md	Iceberg write APIs (append, upsert, dynamic), distribution modes (NONE/HASH/RANGE), CoW vs MoR, read patterns, partitioning, DDL. Does NOT contain catalog choice or maintenance approaches — for those, load `iceberg-tuning-and-operations.md`.
Iceberg tuning, operations, catalog choice, maintenance	iceberg-tuning-and-operations.md	Provides maintenance approaches for S3 Tables, Glue + Glue auto-compaction, and Glue + Flink embedded maintenance with JDBC lock for catalog-choice questions; small files problem and mitigations; Flink TableMaintenance API, post-commit maintenance, lock factories; IcebergSink monitoring, anti-patterns.
CDC connectors	cdc-connector-guide.md	MySQL, PostgreSQL, Oracle, SQL Server, MongoDB CDC
IaC and deployment	iac-and-deployment.md	CloudFormation, CDK, Terraform, two-phase deployment
Serialization	serialization-guide.md	POJO, Avro, Kryo guidance
State management	state-management.md	TTL, state types, migration safety

Additional Resources

GitHub Issues