name: consensus-and-strong-consistency description: > Apply strong consistency mechanisms in distributed systems - ACID transactions, two-phase commit (2PC) for cross-partition transactions and its blocking pathology, consensus algorithms (Paxos, Multi-Paxos, Raft) with terms / AppendEntries / heartbeats / quorum-based commit, linearizability vs serializability, TrueTime + commit-wait (Spanner), single-threaded per-partition execution (VoltDB), and deterministic transaction execution (Calvin / FaunaDB). Use when designing on Spanner / CockroachDB / VoltDB / etcd, picking a consensus algorithm, or diagnosing transactional behavior in a distributed DB. Triggers - "2PC", "Paxos vs Raft", "linearizability vs serializability", "TrueTime", "Spanner", "CockroachDB", "VoltDB", "etcd Raft", "ACID across partitions", "compensating action", "XA transaction". Produces a strong-consistency design with explicit algorithm choice and trade-offs.
Consensus and Strong Consistency
You design and operate distributed systems that require strong consistency across partitions or replicas — ACID transactions, replicated state machines, linearizable reads. Strong consistency at scale is genuinely expensive: 2PC blocks on coordinator failure, consensus algorithms add latency, linearizability across regions requires bounded clocks. This skill is the toolkit.
This skill captures Gorton's Foundations of Scalable Systems (Ch 12): ACID, 2PC, Paxos / Multi-Paxos / Raft, linearizability vs serializability, Spanner's TrueTime + commit-wait, VoltDB's single-threaded-per-partition approach, and Calvin-style deterministic transactions.
For eventually consistent mechanisms (RYOWs, quorums, version vectors,
CRDTs), see eventual-consistency-mechanics.
When to Use This Skill
- Designing on Spanner, CockroachDB, YugabyteDB, VoltDB, etcd
- Picking a consensus algorithm for replicated state
- Implementing distributed transactions across partitions or services
- Diagnosing long latency on distributed-transaction workloads
- Choosing between strong consistency and eventual for a workload
- Deciding when 2PC's blocking pathology is acceptable
ACID, Recapped for Distributed Systems
| Letter | Single-server meaning | Distributed meaning |
|---|---|---|
| Atomicity | All operations in a transaction commit or none do | Hard — requires 2PC across partitions |
| Consistency | Database invariants preserved | App-level; mostly the same |
| Isolation | Concurrent transactions don't interfere | Even harder distributed; requires distributed locks or MVCC + timestamps |
| Durability | Committed data survives crash | Replicated to multiple nodes / disks |
Two senses of "consistency":
- Transactional consistency — business rules (the C in ACID).
- Replica consistency — all replicas see the same state.
Strong consistency in distributed systems usually means both: ACID transactions + linearizable replica behavior.
Two-Phase Commit (2PC)
The classic algorithm for atomicity across partitions or services. Designed by Jim Gray in 1978; still in use everywhere.
Phase 1: PREPARE Phase 2: COMMIT
Coordinator Coordinator
│ │
├── prepare? ────► P1 ├── commit ─────► P1
│ │
├── prepare? ────► P2 ├── commit ─────► P2
│ │
├── prepare? ────► P3 ├── commit ─────► P3
│ │
│ ◄──── yes ─── P1 │ ◄──── ack ─── P1
│ ◄──── yes ─── P2 │ ◄──── ack ─── P2
│ ◄──── yes ─── P3 │ ◄──── ack ─── P3
│ │
│ (or "no" → abort) │ ── DONE ──
Roles:
- Coordinator: orchestrates the transaction.
- Participants: the partitions/services holding pieces of the transaction.
- Transaction context: identifier passed to all participants.
Phases:
- Prepare: coordinator asks each participant if it can commit. Each votes yes or no. Yes = "I have written this and I'm ready; I won't fail unilaterally." No = "Abort."
- Resolve: if all said yes, coordinator sends commit. Otherwise abort. Participants then make the change durable (or roll back) and ack.
The fundamental pathology: coordinator failure between prepare and commit.
┌───────────────────┐
│ Coordinator │
│ (failed) │
└─────────┬─────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐
│ P1 │ │ P2 │ │ P3 │ ← all said "yes" in prepare
│ │ │ │ │ │ ← waiting for commit/abort
│ ? │ │ ? │ │ ? │ ← cannot commit alone
│ │ │ │ │ │ ← cannot abort alone (others may have committed)
└────┘ └────┘ └────┘ ← BLOCKED until coordinator recovers
Participants cannot autonomously decide. They must wait for the coordinator to recover. Locks held during this time. Other transactions can pile up.
Mitigations:
- Highly available coordinator (replicated)
- Tight coordinator-failure detection
- Recovery from coordinator log on restart
- Write-ahead log on participants
- Three-phase commit (3PC) — adds an extra phase to bound participant blocking, but makes other failure modes worse; rarely used in practice
XA / JTA / JTS: the standardized 2PC across heterogeneous databases. Java EE's JTS implements XA. SQL Server, Oracle, IBM DB2 all support XA.
Compensating actions: an alternative to 2PC. Each step has a forward
operation and a compensating operation that semantically reverses it. Used
in Sagas (see distributed-system-patterns). No coordinator blocking; eventual
atomicity instead of immediate.
Consensus Algorithms
Where 2PC handles cross-partition transactions, consensus algorithms let multiple replicas agree on a sequence of state-machine commands. They power replicated state machines, leader elections, distributed locks, and configuration stores.
The Family Tree
Lamport's Paxos (1998) "Correct, hard to understand"
│
▼
Multi-Paxos "Optimization for repeated rounds"
│
▼
Raft (2013, Ongaro & Ousterhout) "An understandable consensus algorithm"
Now the dominant production choice
Raft: The Practical Choice
Replicas play one of three roles:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Leader │ ◄───────► │ Follower │ ◄───────► │ Follower │
│ │ │ │ │ │
│ Receives │ │ Replicates │ Replicates
│ writes │ │ leader's log │ leader's log
│ │ │ │
└──────────┘ └──────────┘ └──────────┘
Concepts:
Term Monotonically increasing logical clock per leadership period.
Leaders are elected per term.
AppendEntries The leader's RPC for replicating log entries to followers.
Heartbeat is AppendEntries with no entries.
Heartbeat Leader sends to followers every 300-500ms typical.
Missing heartbeat → follower may start an election.
Quorum-based commit
Leader marks an entry committed once a MAJORITY of followers
have replicated it.
Election On leader failure, a follower becomes candidate, requests
votes, becomes leader if it gets a majority.
Why Raft beat Paxos in practice:
- Reference implementation (etcd's raft library) is mature and reusable.
- Leader-based simplifies log replication.
- The Raft paper is readable in one sitting; Paxos requires multiple.
- Configuration changes (adding/removing nodes) have a clean specification.
Raft adopters (Ch 12 + 13):
- etcd (Kubernetes' config store)
- Consul, Nomad (HashiCorp)
- Hazelcast
- YugabyteDB, CockroachDB (per-tablet Raft)
- MongoDB (replica set elections)
- Neo4j (clusters)
- RabbitMQ (quorum queues)
Multi-Paxos / Paxos Adopters
- Spanner (per-Paxos-group within a split)
- Chubby (Google's lock service, Paxos-based)
- ZooKeeper uses Zab, a Paxos-flavored protocol
- Kafka used ZooKeeper historically (moving to KRaft, a Raft variant)
Default in 2026: Raft for new systems. Use Multi-Paxos when an existing codebase or organizational expertise demands it.
Linearizability vs Serializability
The two strongest consistency models, from different communities:
| Linearizability | Serializability | |
|---|---|---|
| Community | Distributed systems | Databases |
| Definition | Every operation appears to take effect at a single point in real time, between its invocation and response | Concurrent transactions appear to execute in some sequential order |
| Real-time ordering | Yes — if T1 finished before T2 started, T1 must precede T2 | No — order can be any serial order |
| Strength | Stronger (real-time + total order) | Weaker (just total order) |
| Achievability | Hard at scale (needs synchronized clocks) | Easier (snapshot isolation, MVCC, distributed locks) |
Strict serializability = linearizability + serializability. Spanner's flagship guarantee.
Practical guidance:
- Default to serializability for transactional workloads.
- Linearizability when external clients need real-time guarantees (financial transactions, leader-election outputs, distributed locks).
Spanner: TrueTime and Commit Wait
Spanner offers external consistency — a fancy name for strict serializability across the globe.
The challenge: without a synchronized clock, you can't agree on real- time order across regions.
TrueTime: Google's clock service backed by GPS receivers + atomic clocks in every datacenter, with bounded uncertainty (typically ε ≈ 7 ms).
TrueTime API:
TT.now() → [earliest, latest] ← interval, not point
Example: TT.now() = [12:00:00.043, 12:00:00.057]
→ "real time is somewhere in this 14ms window"
Commit wait: when a transaction commits, Spanner picks a commit timestamp and waits ε milliseconds before reporting success. This guarantees no other node will see a TrueTime that contradicts the order.
Transaction commits at TrueTime = [t-ε, t+ε]
Pick commit_ts = t+ε
Wait until TT.now().earliest > t+ε
Report success
Now any other node querying TrueTime will report > t+ε,
so they'll order this transaction strictly before their reads.
Cost: ~7 ms added to every commit. Worth it for global linearizability in financial / inventory / regulated workloads.
Implication: CockroachDB and YugabyteDB don't have TrueTime hardware. They approximate with HLC (Hybrid Logical Clocks) and NTP, accepting weaker guarantees (snapshot isolation, not external consistency by default). This is one reason Spanner is special.
VoltDB: Single-Threaded Per Partition
A different approach. Within a partition, run one transaction at a time on one thread. No locks, no isolation puzzles — just sequential execution in memory.
Partition #1 Partition #2 Partition #3
─────────── ─────────── ───────────
Single thread Single thread Single thread
txn 1 → done txn 4 → done txn 7 → done
txn 2 → done txn 5 → done txn 8 → done
txn 3 → done txn 6 → done txn 9 → done
... ... ...
Cross-partition transactions: 2PC across partitions.
| Property | Value |
|---|---|
| Single-partition transaction | Cheap (no locking, in-memory) |
| Multi-partition transaction | Expensive (2PC) |
| Storage | In-memory primarily; command log + snapshots for durability |
| Linearizability | Since v6.4 (Jepsen surfaced earlier issues) |
Design implication: model your data so transactions are partition-local. Multi-partition is an escape hatch, not the default.
Calvin / FaunaDB: Deterministic Transactions
Another approach: pre-order all transactions globally, then execute on each replica. Because all replicas see the same input order, they produce the same output. No coordination per transaction — coordination is in the ordering layer.
Client 1 ──┐ All replicas process the same
│ ordered stream:
Client 2 ──┼──► Sequencer ──► Stream ─────► Replica 1 → state
│ (consensus on ────► Replica 2 → state
Client 3 ──┘ transaction order) ────► Replica 3 → state
Each replica's output is deterministic given the input order.
Pros: No 2PC. Replicas independently produce identical state.
Cons: The sequencer is a bottleneck. Less common in production than Spanner / Raft-based systems.
Used in: Calvin (research), FaunaDB (commercial).
Strongly Consistent Reads
Even on systems that default to eventual reads, you can usually request strong reads explicitly:
| System | API |
|---|---|
| DynamoDB | ConsistentRead=true (extra RCU cost) |
| Cassandra | consistency_level=QUORUM or ALL |
| MongoDB | readPreference=primary |
| Cloud Spanner | Default; or read --strong |
| Bigtable | Default within a row |
Cost: an extra round-trip to the leader. Use only where stale reads cause real problems.
Principles
- Prefer transactional consistency where possible. Spanner team:
"It is better to have application programmers deal with performance problems due to overuse of transactions than coding around the lack of transactions."
- Single-partition transactions are cheap; multi-partition trigger 2PC and are slow. Design data models to keep transactions partition-local.
- Use Raft (or Multi-Paxos) for replica state. Raft's reference implementation makes it the practical choice.
- 2PC blocks on coordinator failure. Make the coordinator highly available; have a recovery path.
- Strong consistency at scale costs latency. Don't pay it where eventual is fine.
- Linearizability requires synchronized clocks for global guarantees. TrueTime is the production example; HLC is the open-source approximation.
- Verify consistency claims against Jepsen reports. Marketing consistency != actual consistency under partition.
- Compensating actions (Saga) are the alternative to 2PC for
microservices — eventual atomicity, no coordinator blocking. See
distributed-system-patterns.
Anti-Patterns
2PC Across Microservices Without a Hardened Coordinator
Looks like: Distributed transaction across 5 microservices using XA. Coordinator is the API gateway; if it crashes mid-commit, all 5 services hang waiting.
Why it fails: Coordinator failure blocks participants indefinitely.
The fix: Saga pattern with compensating actions. Or use a workflow engine (Temporal, AWS Step Functions) as the durable, replicated coordinator.
Multi-Partition Transactions as the Norm
Looks like: Schema designed without partition awareness. Every transaction crosses 2-3 partitions.
Why it fails: Each transaction triggers 2PC. Throughput collapses.
The fix: Model partition-local transactions. Co-locate joinable data via shared partition key.
Rolling Your Own Consensus
Looks like: "We'll just use a leader and have followers ack writes." Slowly turns into a half-built Raft.
Why it fails: The edge cases (split-brain, partial ack, leader during partition, log divergence) are exactly the hard ones.
The fix: Use etcd, Consul, or embed a Raft library (etcd-io/raft, hashicorp/raft).
Trusting NTP Clocks for Linearizability
Looks like: "We have NTP, so timestamps are accurate enough." Two events with timestamps 50ms apart actually happened in the opposite order.
Why it fails: NTP skew can be 100ms+. No bound on skew.
The fix: TrueTime (if you're Google) or HLC. Or accept snapshot isolation, not linearizability.
Strongly Consistent Reads Everywhere
Looks like: Every DynamoDB read sets ConsistentRead=true. RCU usage
doubles. Bill spikes.
Why it fails: Most reads are fine eventual.
The fix: Strong consistency only where stale reads break the user experience.
Ignoring Jepsen
Looks like: Vendor claims "linearizable consistency under all conditions" without a Jepsen report.
Why it fails: Jepsen has found bugs in nearly every distributed database. Untested claims are marketing.
The fix: Look for Jepsen reports at jepsen.io. Read them carefully — even systems Jepsen approves of usually have caveats.
Decision Rules
| Situation | Action |
|---|---|
| Replicated config / metadata store | etcd (Raft) |
| Distributed lock | etcd, Consul, or ZooKeeper |
| Replica state machine | Raft library (hashicorp/raft, etcd raft) |
| Cross-partition transactions in OLTP | Spanner / CockroachDB / VoltDB |
| Global linearizability | Spanner (only) |
| ACID open-source distributed | CockroachDB or YugabyteDB |
| Cross-microservice transaction | Saga with compensating actions, not 2PC |
| Transaction within a single shard | Native DB transaction |
| Need linearizable reads | Read from leader via consensus protocol |
| Need to verify consistency claim | Read the Jepsen report |
| Choosing Paxos vs Raft | Raft, unless you have specific Paxos expertise |
Worked Example: Replicated Configuration Store
Context: Microservices fleet needs a shared configuration store with
strong consistency. Updates rare (humans push config); reads frequent
(every service starts with read config); must survive single-node failure.
Choice: etcd (Raft-backed key-value store).
Topology: 5-node etcd cluster across 3 AZs. Quorum = 3.
Operations:
| Op | Mechanism |
|---|---|
| Read | Default: linearizable read via leader |
| Read (relaxed) | Serializable read from any node (slightly stale, fast) |
| Write | Raft consensus across quorum |
| Watch | Long-poll on key changes |
Failure modes:
| Failure | Behavior |
|---|---|
| Single node down | Quorum (3 of 4 remaining) maintained; service continues |
| 2 nodes down | Quorum (3 of 3) maintained; tight |
| 3 nodes down | No quorum; reads/writes fail until recovery |
| Network partition | Minority side can't accept writes |
Why etcd over a custom solution:
- Battle-tested (powers Kubernetes).
- Raft library reused.
- Watch / lease / compare-and-swap built in.
- Jepsen-tested; known guarantees.
What we deliberately didn't do:
- Write our own Raft.
- Use ZooKeeper (heavier, JVM, older operational model).
- Use a relational DB (no native watch; consensus would be bolted on).
Gotchas
- Spanner is not magic. TrueTime is hardware-dependent. Don't assume CockroachDB / YugabyteDB give you the same guarantees.
- 2PC participant blocking is rare in practice but devastating when it happens. Plan for coordinator recovery; monitor coordinator availability separately.
- Raft membership changes have edge cases. Use joint consensus or single-server changes; never multi-server changes at once.
- etcd is not a database. It's a small-key-value store for metadata, not for application data. Volumes: KBs to MBs total, not GBs.
- Cassandra's
LWT(lightweight transactions) use Paxos for compare- and-swap. Slow — don't overuse. - MongoDB transactions across shards trigger 2PC. Single-shard transactions are cheap; cross-shard are expensive.
- Snapshot isolation ≠ serializability. Snapshot isolation has write-skew anomalies. PostgreSQL serializable mode (SSI) closes them.
- Jepsen has surfaced consistency bugs in nearly every system. Read reports for the version you're running.
- HLC (Hybrid Logical Clocks) combine wall and logical clocks; used by CockroachDB and others to bound the clock-uncertainty window without TrueTime hardware.
- Workflow engines (Temporal, AWS Step Functions, Cadence) are effectively durable, replicated coordinators for long-running multi-service transactions. They sidestep 2PC by making the coordinator itself fault- tolerant.
Related Skills
eventual-consistency-mechanics— the cheaper alternativedistributed-system-patterns— Saga (compensating actions instead of 2PC)scalable-database-design-and-sharding— choosing the databasedistributed-systems-essentials— clocks, idempotency, partial failuremicroservices-resilience-patterns— patterns at the service layer
Source: Foundations of Scalable Systems by Ian Gorton, Chapter 12. Read the Raft paper (Ongaro & Ousterhout, 2014), the Spanner paper (Corbett et al., 2012), and the Jepsen reports for the system you're using.