distributed-consensus - SKILL.md Agent Skill

name: distributed-consensus description: | Distributed-systems architecture at the protocol level: consensus (Raft, Paxos, BFT), replication and quorums, consistency models, clock synchronization, and the CAP/PACELC trade-offs. Architect-level — how to make state agree and survive failures.

USE WHEN: designing replicated/consensus systems, "Raft", "Paxos", "BFT", "quorum", "leader election", "consistency model", "linearizability", "CAP", "PACELC", "split-brain", replication topology, distributed state machines.

DO NOT USE FOR: app microservice wiring (use web/enterprise patterns); message queues (use messaging skills); blockchain specifics (use bitcoin skills). allowed-tools: Read, Grep, Glob

Distributed Consensus & Replication

Consensus protocol selection

Protocol	Fault model	Notes
Raft	Crash-fault (f of 2f+1)	Understandable, leader-based; default for etcd/Consul
(Multi-)Paxos	Crash-fault	Foundational, subtle; powers Spanner/Chubby lineage
BFT (PBFT, Tendermint)	Byzantine (f of 3f+1)	For untrusted/adversarial nodes; higher msg cost
Viewstamped Replication	Crash-fault	Raft-like, predates it

Use crash-fault consensus inside a trust boundary; use BFT only when nodes can be malicious (cross-org, blockchain) — it costs more nodes and messages.

Replication & quorums

Quorum: W + R > N for read-your-writes; tune (e.g. N=3, W=2, R=2).
Leader-based (strong, simple, leader bottleneck) vs leaderless (Dynamo-style, available, needs read-repair/anti-entropy + conflict handling).
Sync vs async replication = durability/latency vs RPO on failover.

Consistency models (state the one you need)

Linearizable → sequential → causal → eventual. Stronger = more coordination = higher latency / lower availability. Don't ask for linearizable if causal suffices.

CAP / PACELC

Under a Partition choose C or A; else (no partition) trade Latency vs Consistency. Real systems are points on a spectrum (Spanner: CP + TrueTime clocks; Dynamo: AP). Clocks: avoid relying on wall-clock ordering; use logical / hybrid logical clocks, or bounded-uncertainty clocks (TrueTime) for external consistency.

When to recommend what

Config/coordination, small cluster → Raft (etcd).
Global strong consistency → Paxos/Spanner-style + synchronized clocks.
High availability, geo, conflict-tolerant → leaderless Dynamo-style + CRDTs.
Adversarial/multi-party → BFT.