replication-patterns

star 0

Use when designing how a database keeps multiple copies of its data in agreement across nodes for availability, read scaling, and disaster recovery: the three foundational topologies (single-leader / primary-replica, multi-leader / multi-primary, leaderless / quorum), synchronous vs asynchronous replication and the replication-lag trade-off, log shipping vs statement replication vs trigger-based replication, the read-after-write consistency problem and its mitigations (sticky session, read-from-leader, monotonic reads), the failover model and split-brain risk, and the relationship to the CAP/PACELC choices the topology realizes. Do NOT use for horizontal partitioning across nodes (use sharding-strategy), the CAP theoretical frame itself (use cap-theorem-tradeoffs), single-node transactional guarantees (use acid-fundamentals), or query tuning (use query-optimization).

jacob-balslev By jacob-balslev schedule Updated 6/4/2026

name: replication-patterns description: "Use when designing how a database keeps multiple copies of its data in agreement across nodes for availability, read scaling, and disaster recovery: the three foundational topologies (single-leader / primary-replica, multi-leader / multi-primary, leaderless / quorum), synchronous vs asynchronous replication and the replication-lag trade-off, log shipping vs statement replication vs trigger-based replication, the read-after-write consistency problem and its mitigations (sticky session, read-from-leader, monotonic reads), the failover model and split-brain risk, and the relationship to the CAP/PACELC choices the topology realizes. Do NOT use for horizontal partitioning across nodes (use sharding-strategy), the CAP theoretical frame itself (use cap-theorem-tradeoffs), single-node transactional guarantees (use acid-fundamentals), or query tuning (use query-optimization)." license: MIT allowed-tools: Read Grep metadata: relations: "{"related":["acid-fundamentals","query-optimization","indexing-strategy","cap-theorem-tradeoffs","sharding-strategy","transaction-isolation"],"suppresses":["sharding-strategy","cap-theorem-tradeoffs"],"verify_with":["transaction-isolation","cap-theorem-tradeoffs","sharding-strategy"]}" subject: data-engineering scope: "Designing database replication topologies and operational guardrails for keeping multiple copies of the same data in agreement across nodes: single-leader, multi-leader, leaderless/quorum; synchronous, semi-synchronous, and asynchronous replication; log-shipping and statement/row/trigger/logical mechanisms; read-after-write mitigations; failover and split-brain prevention; conflict resolution; and backup-vs-replica boundaries. Portable across distributed database systems. Excludes horizontal partitioning across nodes (sharding-strategy), the abstract CAP/PACELC frame itself (cap-theorem-tradeoffs), single-node transaction guarantees (acid-fundamentals), isolation-level choice (transaction-isolation), and query/index tuning (query-optimization/indexing-strategy)." public: "true" taxonomy_domain: engineering/data stability: experimental keywords: "["replication","primary replica","multi-leader","leaderless","quorum","synchronous replication","asynchronous replication","replication lag","read-after-write","failover"]" triggers: "["single-leader vs multi-leader","synchronous vs async replication","what happens on failover","split brain","read-after-write consistency"]" examples: "["design replication topology for a service with one region writing and three regions reading","decide between synchronous and asynchronous replication given a target RPO","diagnose stale reads after a write — likely replication lag without read-after-write handling","explain the split-brain risk in multi-leader replication"]" anti_examples: "["horizontally partition data across nodes (use sharding-strategy)","reason about the CAP theorem abstractly (use cap-theorem-tradeoffs)","explain ACID properties (use acid-fundamentals)"]" mental_model: "|" purpose: "|" concept_boundary: "|" analogy: "Replication is to a database what mirror copies of a master photograph are to a museum's record — single-leader async is the photographer keeping the negative and printing copies as requested; single-leader sync is the conservator requiring two darkroom signatures before any print leaves the building; multi-leader is multiple authorized photographers in different cities each accepting submissions and reconciling at intervals; leaderless quorum is asking three of five archivists to vote on whether this print matches the master, accepting their verdict. Failover is replacing the negative-keeper when they retire; split brain is what happens when the agency forgets to revoke the old keeper's keys." misconception: "|" skill_graph_source_repo: "https://github.com/jacob-balslev/skill-graph" skill_graph_project: Skill Graph skill_graph_canonical_skill: skills/data-engineering/replication-patterns/SKILL.md

Concept of the skill

What it is: Replication is the design discipline for keeping multiple copies of the same data on multiple nodes so a database can survive failures, scale reads, or place data near users.

Mental model: The core choices are topology, synchrony, replication mechanism, read-freshness policy, failover policy, and conflict handling. Single-leader systems serialize writes through one primary, multi-leader systems accept writes in more than one place and must merge conflicts, and leaderless systems use quorums so clients can read or write through multiple nodes.

Why it exists: A single database copy creates one point of failure and one read bottleneck. Replication adds resilience and scale, but it also creates lag, failover, stale-read, and split-brain risks that the application must deliberately handle.

What it is NOT: It is not sharding, which splits different data across nodes. It is not CAP theory itself, single-node ACID guarantees, isolation-level selection, query tuning, indexing, or backups.

Adjacent concepts: sharding-strategy partitions data; cap-theorem-tradeoffs names the consistency/availability frame; acid-fundamentals and transaction-isolation describe local transaction guarantees; backup and restore practice protects against replicated corruption or deletion.

One-line analogy: Replication is like keeping synchronized copies of a critical operations log in several control rooms: the system keeps working when one room fails, but everyone needs rules for who may write, how copies catch up, and who takes charge after an outage.

Common misconception: Turning replication on does not automatically create zero data loss, fresh reads, safe failover, or backups; each of those safety properties requires an explicit topology, synchrony, routing, fencing, monitoring, and recovery choice.

Replication Patterns

Coverage

The catalog of replication topologies and the operational discipline that makes them work in production. Covers the three foundational topologies (single-leader / primary-replica, multi-leader / multi-primary, leaderless / quorum), the synchrony spectrum (sync, semi-sync, async, quorum-sync), the mechanism choices (statement-based, row-based, trigger-based, logical, physical), the read-after-write consistency problem and its mitigations (sticky session, read-from-leader, monotonic reads, version tokens), the failover model and quorum-based split-brain prevention, the conflict-resolution choices in multi-leader and leaderless systems (LWW, CRDTs, vector clocks, application merge), and the relationship to the CAP/PACELC choices the topology realizes.

Philosophy of the skill

Replication is the discipline that gives a database fault tolerance, read scaling, and disaster recovery — at the cost of consistency-handling, conflict resolution, and operational complexity.

The default starting point is single-leader with asynchronous replication: simple, well-understood, sufficient for most read-mostly workloads. The departures from this default — multi-leader, leaderless, synchronous, geographic — each address a specific requirement (multi-region writes, strong consistency under partition, RPO=0) and add proportional complexity.

The most common production failures are not in the replication topology itself but in the application's handling of its consequences: stale reads producing user-visible bugs, split brain producing silent data divergence, failover producing unrehearsed surprises. The discipline is treating replication as an architecture the application is co-designed with, not a database feature that handles itself.

Topology Selection

Topology Best for Trade-offs
Single-leader, async Read-heavy workloads with tolerance for stale reads Replication lag; possible data loss on leader failure
Single-leader, sync Workloads requiring RPO=0 Write latency; replica failure can block writes
Multi-leader Multi-region writes; active-active disaster recovery Conflict resolution complexity
Leaderless (quorum) High availability with tunable consistency Read latency = slowest of R; quorum-sizing decisions
Synchronous via Raft/Paxos (Spanner, CockroachDB) Strong consistency at scale High write latency for distant replicas

The starting point for most workloads is single-leader async; depart from this default only when the workload requires it.

Synchronous vs Asynchronous Trade-off

Property Synchronous Asynchronous
Write latency Higher (waits for replica) Lower (acks immediately)
RPO (data loss on failure) Zero Up to lag window
Read consistency from replicas Strong Eventual
Replica failure impact Can block writes None
Use case High-stakes financial transactions, strong-RPO systems Most production read-replicas

Semi-sync (wait for one replica with timeout-to-async fallback) is the middle ground; production-default for many systems.

Read-After-Write Mitigations

Mitigation How it works Cost
Read-from-leader for a window after write Client routes back to leader for N seconds Loses read scaling for write-heavy users
Sticky session Client always reads from same replica Replica failure invalidates session
Monotonic reads Client tracks last-seen version; replica must be ≥ that fresh Requires version-token plumbing
Wait-for-replica Client waits until replica catches up Adds latency to the read
Accept stale reads Application tolerates staleness Only viable when stale data is OK

Every read-mostly workload with async replicas must choose one. Defaulting to "read from any replica" produces stale-data bugs.

Failover and Split-Brain Prevention

Mechanism Split-brain risk Used by
Manual operator failover None (if procedure is correct) Small / legacy systems
Heuristic auto-failover (no quorum) High under partition Older Postgres tools (without proper consensus)
Quorum-based promotion (Raft / Paxos) Eliminated by majority requirement Modern HA tools (Patroni, etcd, CockroachDB internal)
STONITH fencing Old leader killed; cannot continue writing Pacemaker / Corosync HA

A system that must not split-brain uses quorum-based promotion. The cost is requiring an odd number of voting nodes ≥ 3.

Verification

After applying this skill, verify:

  • The replication topology is intentional and documented: single-leader / multi-leader / leaderless; sync / async / semi-sync; physical / logical / trigger-based.
  • Read-after-write consistency is handled explicitly. Sticky session, read-from-leader, monotonic reads, version tokens, or accept-stale is a named choice.
  • Replication lag is monitored. Lag thresholds trigger alerts before users notice.
  • Failover procedure is documented, tested, and rehearsed. Untested failover is failover that fails first time.
  • Split-brain prevention uses quorum-based promotion for any system requiring it. Heuristic failover is recognized as risky.
  • Backups exist separately from replication. Replica state and backup state are distinguished.
  • For multi-leader: the conflict resolution mechanism (LWW / CRDT / vector clocks / app merge) is defined and tested with concurrent-write scenarios.
  • For leaderless: W and R values are chosen for the workload's consistency-vs-latency target; W+R>N if strong consistency is required.
  • Cross-region replication latency is measured and accepted as part of the design budget.

Do NOT Use When

Instead of this skill Use Why
Horizontally partitioning data across nodes sharding-strategy sharding partitions data; replication copies it
Reasoning about the CAP/PACELC theoretical frame cap-theorem-tradeoffs CAP names the trade-off; this skill realizes it
Single-node transactional guarantees acid-fundamentals ACID is the single-system frame
Choosing isolation levels transaction-isolation transaction-isolation owns concurrency; this owns multi-node
Indexing indexing-strategy indexing is within-node retrieval
Tuning a slow query query-optimization query-optimization is per-query

Key Sources

Skill Graph context

Classification

  • Subject: data-engineering
  • Public: true
  • Domain: engineering/data
  • Scope: Designing database replication topologies and operational guardrails for keeping multiple copies of the same data in agreement across nodes: single-leader, multi-leader, leaderless/quorum; synchronous, semi-synchronous, and asynchronous replication; log-shipping and statement/row/trigger/logical mechanisms; read-after-write mitigations; failover and split-brain prevention; conflict resolution; and backup-vs-replica boundaries. Portable across distributed database systems. Excludes horizontal partitioning across nodes (sharding-strategy), the abstract CAP/PACELC frame itself (cap-theorem-tradeoffs), single-node transaction guarantees (acid-fundamentals), isolation-level choice (transaction-isolation), and query/index tuning (query-optimization/indexing-strategy).

When to use

  • design replication topology for a service with one region writing and three regions reading
  • decide between synchronous and asynchronous replication given a target RPO
  • diagnose stale reads after a write — likely replication lag without read-after-write handling
  • explain the split-brain risk in multi-leader replication
  • Triggers: single-leader vs multi-leader, synchronous vs async replication, what happens on failover, split brain, read-after-write consistency

Not for

  • horizontally partition data across nodes (use sharding-strategy)
  • reason about the CAP theorem abstractly (use cap-theorem-tradeoffs)
  • explain ACID properties (use acid-fundamentals)

Related skills

  • Verify with: transaction-isolation, cap-theorem-tradeoffs, sharding-strategy
  • Related: acid-fundamentals, query-optimization, indexing-strategy, cap-theorem-tradeoffs, sharding-strategy, transaction-isolation

Concept

  • Mental model: |
  • Purpose: |
  • Boundary: |
  • Analogy: Replication is to a database what mirror copies of a master photograph are to a museum's record — single-leader async is the photographer keeping the negative and printing copies as requested; single-leader sync is the conservator requiring two darkroom signatures before any print leaves the building; multi-leader is multiple authorized photographers in different cities each accepting submissions and reconciling at intervals; leaderless quorum is asking three of five archivists to vote on whether this print matches the master, accepting their verdict. Failover is replacing the negative-keeper when they retire; split brain is what happens when the agency forgets to revoke the old keeper's keys.
  • Common misconception: |

Keywords

  • replication, primary replica, multi-leader, leaderless, quorum, synchronous replication, asynchronous replication, replication lag, read-after-write, failover
Install via CLI
npx skills add https://github.com/jacob-balslev/skill-graph --skill replication-patterns
Repository Details
star Stars 0
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
jacob-balslev
jacob-balslev Explore all skills →