name: data-schema-registry description: > Use this skill when asked about Schema Registry, Avro, Protobuf, schema evolution, compatibility, Confluent Schema Registry, Apicurio, serialization, deserialization, or schema validation. This skill enforces: Schema Registry architecture and deployment, Avro/Protobuf/JSON Schema definition, compatibility modes (BACKWARD, FORWARD, FULL, NONE), schema evolution best practices, SerDe (serialization/deserialization) patterns, and CI/CD integration for schema governance. Do NOT use for: data contract enforcement, data catalog management, or database schema design. version: "1.1.0" author: "j4flmao" license: "MIT" compatibility: claude-code: true cursor: true codex: true windsurf: true tags: [data, schema, streaming, phase-11]
Data Schema Registry
Purpose
Design and deploy a schema registry with Avro/Protobuf/JSON Schema, compatibility enforcement, SerDe patterns, CI/CD integration, and governance for streaming and batch data.
Agent Protocol
Trigger
Exact user phrases: "Schema Registry", "Avro", "Protobuf", "schema evolution", "compatibility", "Confluent Schema Registry", "Apicurio", "serialization", "deserialization", "schema validation", "subject", "SerDe".
Input Context
- Streaming platform (Kafka, Pulsar, Kinesis)
- Serialization format preference (Avro, Protobuf, JSON Schema)
- Producers and consumers count and languages
- Schema governance maturity level
- CI/CD pipeline structure
- Existing schema management approach
Output Artifact
Schema registry architecture with format selection, compatibility strategy, SerDe configuration, deployment plan, and CI/CD governance integration.
Response Format
# Schema registry deployment
# Avro/Protobuf/JSON schema examples
# Compatibility rules per subject
# SerDe configuration (Kafka + batch)
# CI/CD schema enforcement pipeline
No preamble. No postamble. No explanations. No filler/hedging/transitions. Compress output — why use many token when few do trick.
Completion Criteria
- Schema format selected with rationale
- Schema registry deployed and configured
- Schemas defined for all streaming topics
- Compatibility mode set per subject
- SerDe configured for producers and consumers
- CI/CD pipeline validates schema changes
- Schema governance documented with owner and review process
Max Response Length
350 lines of configuration.
Workflow
Step 1: Select Schema Format
| Format | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Avro | Native Kafka/Confluent support, schema evolution, compact binary | Java-centric, limited JSON-like readability | Kafka streaming, Confluent ecosystem |
| Protobuf | Language-agnostic, fastest serialization, gRPC native | Schema evolution more strict, larger schemas | Cross-language, gRPC services |
| JSON Schema | Readable, widely understood, web-native | Larger payload, slower parsing | REST APIs, web frontends |
Default: Avro for Kafka/Confluent stacks. Protobuf for gRPC + streaming. JSON Schema for REST APIs only.
Step 2: Define Avro Schema
{
"type": "record", "name": "Order", "namespace": "com.org.data.orders",
"doc": "Order event schema",
"fields": [
{"name": "order_id", "type": "string", "doc": "Unique order identifier"},
{"name": "customer_id", "type": "string"},
{"name": "total_amount", "type": "double"},
{"name": "currency", "type": "string", "default": "USD"},
{"name": "status", "type": {"type": "enum", "name": "OrderStatus",
"symbols": ["PENDING", "CONFIRMED", "SHIPPED", "DELIVERED", "CANCELLED"]}},
{"name": "items", "type": {"type": "array", "items": {
"type": "record", "name": "LineItem",
"fields": [
{"name": "product_id", "type": "string"},
{"name": "quantity", "type": "int"},
{"name": "unit_price", "type": "double"}
]
}}},
{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
]
}
Step 3: Set Compatibility Modes
| Mode | Producer Change | Consumer Impact | Use Case |
|---|---|---|---|
| BACKWARD | Can delete fields, add optional | Old consumers read new data | Default for most |
| FORWARD | Can add fields, delete optional | New consumers read old data | Long-lived consumers |
| FULL | Add optional fields only | Both directions compatible | Strict governance |
| NONE | Any change allowed | Consumers must sync | Dev/test only |
Step 4: Deploy Schema Registry
services:
schema-registry:
image: confluentinc/cp-schema-registry:7.6.0
ports: ["8081:8081"]
environment:
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_COMPATIBILITY_LEVEL: BACKWARD
SCHEMA_REGISTRY_LISTENERS: http://0.0.0.0:8081
SCHEMA_REGISTRY_KAFKASTORE_TOPIC: _schemas
Step 5: Configure SerDe
// Kafka Producer — Avro Serde with Schema Registry
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class.getName());
props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://schema-registry:8081");
props.put(KafkaAvroSerializerConfig.AUTO_REGISTER_SCHEMAS, "false");
props.put(KafkaAvroSerializerConfig.USE_LATEST_VERSION, "true");
Step 6: CI/CD Schema Validation
jobs:
schema-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate and Check Compatibility
run: |
for schema in schemas/**/*.avsc; do
python scripts/validate_and_register.py \
--subject "$(basename $schema .avsc)-value" \
--schema "$schema" --mode BACKWARD
done
- name: Register Schema
if: success() && github.ref == 'refs/heads/main'
run: |
for schema in schemas/**/*.avsc; do
subject=$(basename "$schema" .avsc)-value
python scripts/register_schema.py --subject "$subject" --schema "$schema"
done
Step 7: Schema Evolution Best Practices
| Rule | Rationale |
|---|---|
Always provide default for new fields |
Ensures backward compatibility |
| Never remove a field without deprecation period | Avoids breaking consumers |
| Use enum types with care | Adding symbols is backward-safe, removing is not |
| Key subjects use FULL compatibility | Keys are referenced across topics |
| Auto-register schemas disabled in production | Prevents accidental schema registration |
Step 8: AsyncAPI and JSON Schema
AsyncAPI pairs with Schema Registry by documenting event-driven API message flow with topic names, publish/subscribe patterns, and payload schemas.
Step 9: Buf for Schema Management
Buf enforces Protobuf lint rules and breaking change detection in CI/CD. buf breaking --against .git checks breaking changes against previous commit. Use for gRPC microservices requiring rigorous Protobuf governance.
Step 10: Schema Registry as Source of Truth
The schema registry is the authoritative source for all streaming schemas. Producers must register schemas before writing data. Consumers fetch schemas from registry to deserialize. No schema should be hardcoded in application code — always reference the registry.
Step 11: Multi-Tenant Schema Registry
For organizations with multiple business units, use subject prefix isolation (commerce.orders-value, finance.invoices-value). Each tenant manages its own schemas within its prefix. The platform team manages registry infrastructure. Use Confluent Schema Registry's authorization mechanism with RBAC to isolate tenant access.
Step 12: Schema Migration Strategies
For breaking schema changes that cannot be avoided, follow a multi-step migration:
- Add new fields with defaults (compatible change) — deploy
- Allow old and new schemas to coexist (dual-schema period) — 30 days
- Deprecate old fields (mark as deprecated in schema doc) — notify consumers
- Remove old fields (breaking change, MAJOR version) — deploy
- Clean up old schema versions that are no longer referenced
Architecture / Decision Trees
Format Selection
New schema needed
├── Kafka/Confluent ecosystem? → Avro
├── gRPC services / multi-language? → Protobuf
├── REST APIs / web frontends? → JSON Schema
└── Need both streaming + REST? → Avro for streaming, JSON Schema for REST
Compatibility Mode Selection
Subject type:
├── Key → FULL (keys are critical, must never break)
├── Value, production → BACKWARD (safe default)
├── Value, strict governance → FULL
├── Value, CDC/stream consumers → FORWARD
└── Dev/test → NONE
Registry Deployment Topology
Deployment scale:
├── Single team, < 100 subjects → Single instance
├── Multi-team, < 1000 subjects → HA cluster (3 nodes)
├── Multi-region, < 10000 subjects → Per-registry + replication
└── Enterprise, global → Multi-registry federation
Common Pitfalls
- Auto-register schemas in production: a single misconfigured producer can register an incorrect schema. Always set
auto.register.schemas = false. - Enum removal: removing symbols from an enum is a breaking change. Once data exists with a symbol, it cannot be removed.
- No default on new fields: adding a required field (no default) breaks backward compatibility. Always provide a default.
- Schema registry single point of failure: without HA, schema registry downtime blocks producers and consumers. Deploy with replication.
- Subject naming inconsistency: different teams use different naming conventions. Enforce a standard:
<topic_name>-keyand<topic_name>-value. - Ignoring deleted schema versions: schemas can be deleted but referenced data persists. Use soft-delete or version archiving.
- Protobuf field number reuse: reusing field numbers breaks wire compatibility. Never reuse old field numbers.
- No validation before registration: registering invalid schemas pollutes the registry. Validate locally before submitting.
- Key schema treated same as value: key schemas need stricter compatibility (FULL) because keys are referenced across topics.
- Not using transitive compatibility: non-transitive only checks against latest version. Transitive checks against all versions.
- Missing schema evolution documentation: consumers don't know what changed. Maintain a schema changelog with migration notes.
Best Practices
- Enforce schema compatibility checks in CI/CD before merging to main.
- Every schema change requires a review by the schema governance team.
- Schema registry is the source of truth for all streaming schemas.
- Monitor compatibility check failures and broken producers/consumers.
- Use subject prefixes for multi-tenant registries (e.g.,
commerce.orders-value). - Archive schemas older than 2 years to reduce registry size.
- Automate schema evolution: add-only for minor versions, deprecation for major.
- Test schema changes with shadow consumers before deploying.
- Maintain a schema changelog for consumer awareness.
- Use transitive compatibility enforcement for critical subjects.
- Pin producer/consumer schema versions for canary deployments.
- Document field semantics with
docattribute in schema definition. - Use schema references ($ref) for shared types across schemas.
- Set up monitoring for schema registration failures and compatibility check latency.
Compared With
| Feature | Confluent SR | Apicurio | Buf Schema Registry |
|---|---|---|---|
| Formats | Avro, Protobuf, JSON | Avro, Protobuf, JSON, GraphQL, OpenAPI | Protobuf only |
| Compatibility | BACKWARD, FORWARD, FULL, NONE, TRANSITIVE | Same + CUSTOM | Wire + Source compatibility |
| Deployment | Standalone, Confluent Cloud | Standalone, Red Hat | SaaS, self-hosted |
| Integrations | Kafka, KSQL, Connect | Kafka, Quarkus, Camel | gRPC, Connect |
| Governance | Client-side enforcement | Server-side rules | Lint + breaking change rules |
| Ecosystem | Largest Kafka ecosystem | Cloud-native Java | Protobuf-centric |
Schema registry vs data catalog: schema registry focuses on schema storage, compatibility checking, and providing schemas at serialization/deserialization time. Data catalog focuses on metadata management, discovery, and lineage. They are complementary: the schema registry feeds schemas to the catalog, and the catalog provides business context around schemas.
Performance
- Schema registry adds ~5-15ms latency per serialization/deserialization call (cached on client after first fetch).
- Avro binary serialization: 50-200MB/s throughput per core, payload 60-80% smaller than JSON.
- Protobuf serialization: 100-400MB/s throughput per core, payload 70-85% smaller than JSON.
- JSON Schema: 20-50MB/s throughput, payload same size as JSON (or larger with $ref expansion).
- Schema fetch: first request fetches schema from registry (~10ms), subsequent requests use local cache.
- Avro schema resolution: compatible reader/writer schema resolution happens on deserialization, adding ~1-5us per record.
- Registry caching: client-side schema cache with LRU eviction. Configure cache size based on number of subjects.
- Registry throughput: Confluent SR handles 1000+ schema registrations/second on modest hardware.
- Network overhead: schema registry communication adds ~1-2ms per request in the same region, 10-50ms cross-region.
- Global cache invalidation: when schema changes, existing cached schemas remain valid. Only new versions trigger fresh fetch.
Tooling
| Tool | Purpose |
|---|---|
| Confluent Schema Registry | Primary schema registry for Kafka |
| Apicurio | Alternative registry, multi-format support |
| Buf | Protobuf linting, breaking change detection, BSR |
| Avro Tools CLI | Schema validation, code generation |
| protoc | Protobuf compilation, code generation |
| AsyncAPI | Event-driven API documentation |
| Karapace | Open-source schema registry (Kafka-compatible API) |
| kcat | Kafka command-line tool with schema support |
| SchemaCrawler | Schema visualization and documentation |
Protobuf Schema Definition
syntax = "proto3";
package com.org.data.orders;
import "google/protobuf/timestamp.proto";
message Order {
string order_id = 1;
string customer_id = 2;
repeated LineItem line_items = 3;
double total_amount = 4;
Currency currency = 5;
OrderStatus status = 6;
google.protobuf.Timestamp created_at = 7;
google.protobuf.Timestamp updated_at = 8;
map<string, string> metadata = 9;
message LineItem {
string product_id = 1;
int32 quantity = 2;
double unit_price = 3;
}
enum Currency {
CURRENCY_UNSPECIFIED = 0;
USD = 1;
EUR = 2;
GBP = 3;
}
enum OrderStatus {
STATUS_UNSPECIFIED = 0;
PENDING = 1;
CONFIRMED = 2;
SHIPPED = 3;
DELIVERED = 4;
CANCELLED = 5;
}
}
JSON Schema Definition
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Order",
"type": "object",
"properties": {
"order_id": { "type": "string", "description": "Unique order identifier" },
"customer_id": { "type": "string" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"product_id": { "type": "string" },
"quantity": { "type": "integer", "minimum": 1 },
"unit_price": { "type": "number", "minimum": 0 }
},
"required": ["product_id", "quantity", "unit_price"]
}
},
"total_amount": { "type": "number", "minimum": 0 },
"status": { "type": "string", "enum": ["pending", "confirmed", "shipped", "delivered", "cancelled"] },
"created_at": { "type": "string", "format": "date-time" }
},
"required": ["order_id", "customer_id", "total_amount"]
}
Registry Deployment Patterns
# Confluent Schema Registry HA deployment
registry_cluster:
nodes: 3 # Minimum 3 for quorum-based HA
storage: "kafka" # Uses internal Kafka topic _schemas
replication_factor: 3
kafka_bootstrap_servers: "broker1:9092,broker2:9092,broker3:9092"
# Multi-datacenter setup
primary_region: "us-east-1"
secondary_region: "us-west-2"
replication:
type: "active-standby" # Primary handles writes, standby serves reads
sync_interval: "5s" # Schema replication delay
# Security
ssl:
enabled: true
mutual_auth: true # mTLS for producer/consumer auth
rbac:
enabled: true # Role-based access control
admin_principal: "schema-admin"
# Apicurio Registry deployment (alternative, multi-format)
apicurio_registry:
storage: "sql" # PostgreSQL, SQL Server, or Kafka
formats: ["AVRO", "PROTOBUF", "JSON_SCHEMA", "ASYNCAPI", "OPENAPI"]
rules:
global: ["VALIDITY", "COMPATIBILITY"]
per_artifact: true
auth: "keycloak" # OIDC integration
ui: true
SerDe Configuration
# Kafka Avro SerDe (Confluent)
kafka_producer_config:
key.serializer: "org.apache.kafka.common.serialization.StringSerializer"
value.serializer: "io.confluent.kafka.serializers.KafkaAvroSerializer"
schema.registry.url: "https://schema-registry:8081"
auto.register.schemas: false
use.latest.version: true # Use latest compatible schema version
kafka_consumer_config:
key.deserializer: "org.apache.kafka.common.serialization.StringDeserializer"
value.deserializer: "io.confluent.kafka.serializers.KafkaAvroDeserializer"
schema.registry.url: "https://schema-registry:8081"
specific.avro.reader: true
# Kafka Protobuf SerDe
kafka_protobuf_producer:
key.serializer: "org.apache.kafka.common.serialization.StringSerializer"
value.serializer: "io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer"
schema.registry.url: "https://schema-registry:8081"
# Batch processing with Avro (Spark)
spark_avro_config:
spark.sql.avro.compression.codec: "snappy"
spark.sql.avro.schema.literal: "{\"type\":\"record\",\"name\":\"Order\",...}" # Or use schema registry
CI/CD Schema Governance
# GitHub Action: validate schema compatibility on PR
schema_validation:
name: "Validate Schema Change"
steps:
- "Extract new schema version from PR diff"
- "Register in schema registry with compatibility check only (dry-run)"
- "Validate: compatibility_mode = BACKWARD"
- "If compatible: merge PR, register new schema version"
- "If breaking: reject PR, notify producer + consumer owners"
compatibility_rules:
BACKWARD:
- "New schema can read data written with old schema"
- "Allowed: adding optional fields, removing default fields"
- "Breaking: removing required fields, changing field types"
FORWARD:
- "Old schema can read data written with new schema"
- "Allowed: removing fields, adding default fields"
- "Breaking: adding required fields"
FULL:
- "Both backward and forward compatible"
- "Most restrictive — use for shared datasets"
# Example: CI check script
check_schema_compatibility:
cli: "curl -X POST https://registry:8081/compatibility/subjects/orders-value/versions \
-H 'Content-Type: application/vnd.schemaregistry.v1+json' \
-d '{\"schema\": \"$(cat new_schema.avsc | jq -Rs .)\"}'"
Subject Naming Strategy
subject_naming:
pattern: "<domain>.<dataset-name>-<key|value>"
examples:
- "orders.order-events-value"
- "inventory.stock-updates-key"
- "analytics.user-sessions-value"
record_name_strategy:
# Confluent default: topic-name-value
# RecordNameStrategy: uses schema record name
# TopicRecordNameStrategy: topic-name + record name
use: "TopicRecordNameStrategy"
rationale: "Allows multiple record types per topic, better for union types"
Schema Migration Workflow
migration_workflow:
phase_1_propose:
- "Create new schema version in development branch"
- "Run compatibility check against latest production version"
- "Document: what changed, why, expected impact"
- "Tag with semver change type (MAJOR/MINOR/PATCH)"
phase_2_review:
- "Notify all topic consumers of upcoming change"
- "Check consumer compatibility reports"
- "If BACKWARD compatible: standard review"
- "If breaking: consumer acceptance required, migration plan needed"
phase_3_deploy:
- "Register new schema version (not yet default)"
- "Stage deployment: producers and consumers upgrade gradually"
- "Phase A: all consumers upgraded to handle new schema (read-compatible)"
- "Phase B: producers switch to new schema version"
- "Phase C: old schema deprecated, retention period starts"
phase_4_cleanup:
- "Archive old schema versions after retention (typically 6 months)"
- "Remove deprecated field documentation from registry"
- "Update consumer documentation with new schema details"
Decision Trees
Compatibility Mode Selection
Consumer deployment flexibility?
├── Consumers can upgrade at any time (same team/org)
│ └── BACKWARD compatibility (default, practical)
├── Consumers on fixed release cycles (out of sync)
│ └── FULL compatibility (both directions)
├── Producers need flexibility to iterate quickly
│ └── FORWARD compatibility (producers can remove fields)
├── Protobuf with wire compatibility
│ └── Use Protobuf built-in compatibility (field numbers never reused)
└── Experimental / dev topics
└── NONE (no compatibility enforcement, dev only)
Schema Format Selection
Primary use case?
├── Kafka streaming, Confluent ecosystem
│ └── Avro (best tooling, native Confluent support)
├── Cross-language services, gRPC
│ └── Protobuf (fastest, most language bindings)
├── REST APIs, web frontends, NoSQL
│ └── JSON Schema (readable, web-native)
├── Event-driven APIs, AsyncAPI
│ └── JSON Schema with AsyncAPI wrapper
└── Mixed ecosystem
└── Apicurio (multi-format registry)
Rules
- Every Kafka topic has a registered schema (key + value)
- Production topics enforce BACKWARD or FULL compatibility
auto.register.schemas = falsein all production producers- Schema changes reviewed via PR before registration
- Deprecated fields documented with removal version
- Enum symbols never removed once data exists in topics
- Schema registry replicated across regions for HA
- No schema change without compatibility check in CI/CD
- Subject naming follows
<domain>.<topic-name>-<key/value>convention - Key subjects use FULL compatibility mode
- Use transitive compatibility for critical subjects
- Archive unused schema versions after 2 years
- Document field semantics in schema
docattribute - Test schema changes with shadow consumers before production rollout
- Maintain a schema changelog visible to all consumers
- Use latest schema version in consumers for forward compatibility
- Never reuse field numbers in Protobuf schemas
- Validate schemas in CI/CD before merging PRs
References
- references/registry-setup.md — Schema Registry Setup
- references/schema-evolution.md — Schema Evolution
- references/schema-governance.md — Schema Governance
- references/schema-migration-strategies.md — Schema Migration Strategies
- references/schema-registry-operations.md — Schema Registry Operations Reference
- references/schema-registry-tools.md — Schema Registry Ecosystem Tools
- references/schema-registry-evolution.md — Schema Registry Evolution Deep Dive
- references/schema-registry-integration-patterns.md — Integration Patterns Reference
Handoff
data-data-platform for registry deployment. data-data-catalog for schema metadata. data-data-contracts for data contract schema integration. data-data-observability for schema drift monitoring.