name: raft-cluster-drill description: Run a 3-node meta-service Raft cluster failover drill for RobustMQ. Use when the user wants to verify cluster startup, node removal/recovery, leader election, snapshot replication, or graceful shutdown — e.g. "演练 raft 集群", "test cluster failover", "验证节点删除/恢复", "run the raft drill".
RobustMQ Raft Cluster Drill
A reproducible drill to verify the meta-service Raft cluster behaves correctly on RobustMQ's openraft-based multi-runtime broker. It exercises bootstrap, join, leader election, node recovery, snapshot replication, and graceful shutdown.
When to Use
- Verify a fresh 3-node cluster starts and forms membership
{1,2,3} - Confirm killing a node does NOT change membership and (if it was leader) triggers re-election
- Confirm restarting a node recovers automatically (no re-join needed)
- Confirm
Ctrl+C/kill -INTexits each node gracefully - Verify state-machine data correctness under repeated kill/restart: data written
through Raft survives restarts, all 3 nodes converge,
last_appliedis monotonic (write→restart→read-back; Drill F) - Regression-check the openraft fixes: signal handling, snapshot send/receive, no leave-on-stop, total_order_seek prefix reads
Prerequisites
- Build the binary first (drills must run the compiled binary, NOT
cargo run—cargo runintercepts SIGINT via its parent process and masks real shutdown behavior):cargo build --package cmd --bin broker-server - Cluster configs live in
config/cluster/server-{1,2,3}.toml(broker_id 1/2/3, grpc 1228/2228/3228, admin http 58080/58082/58083, meta_addrs = all three). Data dirs:data/broker-{1,2,3}. - Admin list endpoints (
/api/cluster/*/list) default tolimit=10. When a drill accumulates >10 rows, ALWAYS pass an explicit?limit=100000, or the lexicographically-last rows are paginated out and look "missing". Trusttotal_countover the returned row count. - Run drill scripts with bash, not zsh (zsh's 1-based arrays + no
$varword-splitting silently break the kill order and port lookups).
Key Semantics (openraft) — what "correct" looks like
| Action | Expected behavior |
|---|---|
| First node, no peer reachable | single-node bootstrap → becomes Leader |
| Fresh node, peer reachable | join (add_learner + change_membership), snapshot streamed if lagging |
| Node restart (has persisted state) | recover — openraft re-establishes replication, no re-join, membership unchanged |
| Stop a node | membership unchanged {1,2,3} (stop ≠ scale-in); if it was leader, remaining quorum re-elects |
| Graceful stop | SIGINT/SIGTERM → optional drain → process exits (libc handler + watchdog) |
Quorum rule: 3 nodes → need ≥2 alive to elect/serve. Never kill down to <quorum
while a node is still starting, or that node hangs in check_meta_service_status.
Drill Steps
0. Always clean state first (mandatory)
Every drill run starts from a clean slate — kill any leftover process and wipe all node data. Stale Raft state from a previous run will otherwise interfere (wrong membership, leftover leader, etc.):
pkill -9 -f "broker-server --conf" 2>/dev/null; sleep 1
rm -rf data/broker-1 data/broker-2 data/broker-3 data/logs
1. Start the 3-node cluster (staggered so node1 bootstraps first)
./target/debug/broker-server --conf config/cluster/server-1.toml > /tmp/n1.log 2>&1 &
sleep 7
./target/debug/broker-server --conf config/cluster/server-2.toml > /tmp/n2.log 2>&1 &
sleep 9
./target/debug/broker-server --conf config/cluster/server-3.toml > /tmp/n3.log 2>&1 &
sleep 12
Verify in logs:
- node1:
No reachable peers, bootstrapping single-node cluster (node 1) - node2/node3:
Successfully joined cluster via peer 127.0.0.1:1228(NOsnapshot not found)
2. Query Raft state via admin API (/api/info)
Each admin port returns data.meta with the three state machines
(metadata_0, offset_0, data_0). Cross-check all nodes agree:
for p in 58080 58082 58083; do
echo "--- port $p ---"
curl -s http://localhost:$p/api/info | python3 -c "
import sys,json
d=json.load(sys.stdin).get('data',{})
for n,rs in sorted(d.get('meta',{}).items()):
cfg=rs.get('membership_config',{}).get('membership',{}).get('configs',[])
la=(rs.get('last_applied') or {}).get('index')
print(f' [{n}] state={rs.get(\"state\")} leader={rs.get(\"current_leader\")} term={rs.get(\"current_term\")} last_log_index={rs.get(\"last_log_index\")} last_applied={la} members={cfg}')
"
done
Expect: all three nodes report the same leader/term per shard, members [[1,2,3]],
and identical last_applied / last_log_index per shard (consensus converged).
Drills A–F are historical names; run them in the section order below (A remove → B rejoin → D rolling → E sustained stress → C graceful shutdown → F state-machine correctness). Each is self-contained given a running cluster.
3. Drill A — remove a node (membership + leader check)
Kill a node, wait for graceful exit, re-query:
kill -INT $(pgrep -f "server-3.toml") # or server-1 to test leader failover
# wait until the process exits, then re-run step 2
Expect:
- Killed node exits gracefully in ~5–7s.
- Membership stays
[[1,2,3]](stop does not remove the node). - If you killed the leader, remaining nodes re-elect (new
current_leader, higherterm). - If you killed a follower, the leader is unchanged.
4. Drill B — rejoin the node (recovery)
Restart the killed node; it has persisted state so it recovers:
./target/debug/broker-server --conf config/cluster/server-3.toml > /tmp/n3.log 2>&1 &
sleep 12
Expect in log: Node 3 has persisted state, recovering existing cluster and
Meta Service cluster is ready. Re-run step 2 — all nodes back, consensus consistent.
5. Drill D — rolling kill + restart of every node (failover stress)
Kill each node in turn, observe leader failover, then restart it and observe it rejoin — going around all three nodes. Always keep ≥2 nodes alive (kill one, restart it before killing the next), otherwise quorum is lost.
for c in 1 2 3; do
echo "===== rolling: node$c ====="
# leader before kill (ask a node that stays up)
ask=$([ "$c" = 1 ] && echo 58082 || echo 58080)
echo "leader before: $(curl -s http://localhost:$ask/api/info | python3 -c "import sys,json;m=json.load(sys.stdin).get('data',{}).get('meta',{});print({n:rs.get('current_leader') for n,rs in m.items()})")"
# kill node$c, wait for graceful exit
kill -INT $(pgrep -f "server-$c.toml")
while pgrep -f "server-$c.toml" >/dev/null; do sleep 1; done
sleep 4
echo "leader after kill: $(curl -s http://localhost:$ask/api/info | python3 -c "import sys,json;m=json.load(sys.stdin).get('data',{}).get('meta',{});print({n:rs.get('current_leader') for n,rs in m.items()})")"
# restart node$c, wait to rejoin
./target/debug/broker-server --conf config/cluster/server-$c.toml > /tmp/n$c.log 2>&1 &
sleep 14
grep -oE "recovering existing cluster|Meta Service cluster is ready" /tmp/n$c.log | sort -u
echo "members after rejoin: $(curl -s http://localhost:$ask/api/info | python3 -c "import sys,json;m=json.load(sys.stdin).get('data',{}).get('meta',{});print([rs.get('membership_config',{}).get('membership',{}).get('configs',[]) for rs in m.values()][:1])")"
done
Expect each iteration:
- Killing a leader → remaining quorum re-elects (leader changes, term +1). Killing a follower → leader unchanged.
- Membership stays
[[1,2,3]]throughout (kill ≠ scale-in). - Restarted node logs
recovering existing cluster+Meta Service cluster is ready, rejoins automatically, all nodes converge on the same leader/term.
5b. Drill E — sustained rolling kill+restart stress (graceful-exit watch)
A longer stress variant of Drill D focused on graceful shutdown under churn:
across many rounds, kill+restart every node, plus one extra operation per
round (so each round performs 4 kill+restart cycles, e.g. nodes
1 → 2 → 3 → 1). The wrap-around (operating node 1 again) deliberately hits a
node that was just restarted in the same round, exercising
kill-during-recovery. Always keep ≥2 nodes alive — restart each node and
let it fully rejoin before killing the next.
For each kill, this drill asserts the node exits gracefully (prints
Termination signal received and exits on its own within the watchdog window —
NO kill -9 fallback, NO 20s force-exit). For each restart, it asserts the node
rejoins (recovering existing cluster + Meta Service cluster is ready).
GRACE_TIMEOUT=25 # seconds to wait for self-exit before declaring a hang
graceful_kill() { # $1 = node id
local c=$1 t0
kill -INT "$(pgrep -f "server-$c.toml")"
t0=$(date +%s)
while pgrep -f "server-$c.toml" >/dev/null; do
sleep 1
if [ $(($(date +%s) - t0)) -gt "$GRACE_TIMEOUT" ]; then
echo " !!! node$c DID NOT EXIT in ${GRACE_TIMEOUT}s — HANG, forcing kill -9"
kill -9 "$(pgrep -f "server-$c.toml")"; return 1
fi
done
local secs=$(($(date +%s) - t0))
# graceful = printed the shutdown banner AND did not hit the watchdog force-exit
if grep -q "Termination signal received" /tmp/n$c.log && \
! grep -q "forcing exit" /tmp/n$c.log; then
echo " node$c exited GRACEFULLY in ${secs}s"
else
echo " !!! node$c exit was NOT graceful (no banner or watchdog force-exit) in ${secs}s"; return 1
fi
}
restart_node() { # $1 = node id
local c=$1
./target/debug/broker-server --conf config/cluster/server-$c.toml > /tmp/n$c.log 2>&1 &
sleep 14
if grep -q "Meta Service cluster is ready" /tmp/n$c.log; then
echo " node$c REJOINED ($(grep -oE 'recovering existing cluster' /tmp/n$c.log | head -1 | sed 's/.*/recovering/'))"
else
echo " !!! node$c did NOT report cluster ready after restart"
fi
}
for round in $(seq 1 10); do
echo "========== Drill E round $round =========="
for c in 1 2 3 1; do # 3 nodes + 1 wrap-around = 4 ops/round
echo "--- op: node$c (kill+restart) ---"
graceful_kill "$c"
restart_node "$c"
done
# consensus + error scan after each round
ask=58080
echo "members: $(curl -s http://localhost:$ask/api/info | python3 -c "import sys,json;m=json.load(sys.stdin).get('data',{}).get('meta',{});print([rs.get('membership_config',{}).get('membership',{}).get('configs',[]) for rs in m.values()][:1])" 2>/dev/null)"
PANIC=$(( $(grep -c 'invalid state: expect' /tmp/n1.log)+$(grep -c 'invalid state: expect' /tmp/n2.log)+$(grep -c 'invalid state: expect' /tmp/n3.log) ))
NONE=$(( $(grep -c 'last_log_id=None' /tmp/n1.log)+$(grep -c 'last_log_id=None' /tmp/n2.log)+$(grep -c 'last_log_id=None' /tmp/n3.log) ))
ERR=$(( $(grep -c ' ERROR' /tmp/n1.log)+$(grep -c ' ERROR' /tmp/n2.log)+$(grep -c ' ERROR' /tmp/n3.log) ))
FORCE=$(( $(grep -c 'forcing exit' /tmp/n1.log)+$(grep -c 'forcing exit' /tmp/n2.log)+$(grep -c 'forcing exit' /tmp/n3.log) ))
echo "round $round: raft_panic=$PANIC last_log_id_None=$NONE ERROR=$ERR watchdog_force_exit=$FORCE"
done
Watch for (any of these = a problem to investigate):
- Hang: a node not exiting within
GRACE_TIMEOUTonkill -INT(onlykill -9works). - Non-graceful exit: watchdog
forcing exitfired, or noTermination signal receivedbanner. raft_panic/last_log_id_None/clean_hole> 0: the storage regression resurfaced.- Membership drifted off
[[1,2,3]], or a node failed to reportMeta Service cluster is readyafter restart. ERRORlines that are not the known-benign startupConnection refused(peer still starting).
6. Drill C — graceful shutdown
Kill nodes one by one, each waiting for the previous to exit (so quorum is never lost while a node is still starting):
for c in 3 2 1; do
kill -INT $(pgrep -f "server-$c.toml")
while pgrep -f "server-$c.toml" >/dev/null; do sleep 1; done
echo "node$c exited"
done
Expect: each node prints Termination signal received, the service starts to stop
then exits. The leader/data-leader node may take longer (waits on replication to
downed peers) but still exits within the watchdog window.
7. Drill F — state-machine correctness under repeated kill/restart
This is the deepest drill: it verifies the state machine itself stays correct across repeated kill/restart, not just that nodes come back up. Each round writes data through Raft, kills+restarts every node (leader first, on purpose), then proves the data survived and the three nodes agree.
Verifies four properties:
- Write→restart→read-back — data committed before a restart is still readable
on every node afterward (the restarted node must rebuild its cache from the
persisted state machine). Uses tenants:
POST /api/cluster/tenant/create(Raft-committed, auto-forwarded to leader) andGET /api/cluster/tenant/list(served from each node's local cache, which Raft apply maintains). No auth. - Three-node consensus convergence — every node's
data.meta.<group>reports the samelast_applied.index/current_term/current_leader/ members per Raft group. last_appliedmonotonicity —last_applied.indexper group never goes backward across rounds (apply progress is durably persisted).- Targeted leader kill — each round kills the current metadata leader first, so failover + graceful-exit-of-leader is exercised every round.
Run with bash, NOT zsh. Save as a script with
#!/bin/bashand runbash drill_f.sh. zsh's 1-based arrays and lack of$varword-splitting silently break the port lookup and the kill order (the cluster never gets killed and writes go to an empty port) — the drill then looks like it passes while doing nothing.
#!/bin/bash
set -u
GRACE_TIMEOUT=25
port_of() { case "$1" in 1) echo 58080;; 2) echo 58082;; 3) echo 58083;; esac; }
graceful_kill() { # $1 = node id; asserts self-exit + graceful, no kill -9
local c=$1 t0 pid; pid=$(pgrep -f "server-$c.toml")
[ -z "$pid" ] && { echo " !!! node$c not running before kill"; return 1; }
kill -INT "$pid"; t0=$(date +%s)
while pgrep -f "server-$c.toml" >/dev/null; do sleep 1
[ $(($(date +%s)-t0)) -gt "$GRACE_TIMEOUT" ] && { echo " !!! node$c HANG -> kill -9"; kill -9 "$(pgrep -f "server-$c.toml")"; return 1; }
done
grep -q "Termination signal received" /tmp/n$c.log && ! grep -q "forcing exit" /tmp/n$c.log \
&& echo " node$c GRACEFUL ($(($(date +%s)-t0))s)" || { echo " !!! node$c NOT graceful"; return 1; }
}
restart_node() { ./target/debug/broker-server --conf config/cluster/server-$1.toml > /tmp/n$1.log 2>&1 & sleep 14; }
# find the metadata leader node id by asking any live node
meta_leader() {
local p L
for p in 58080 58082 58083; do
L=$(curl -s "http://127.0.0.1:$p/api/info" | python3 -c "import sys,json;m=json.load(sys.stdin).get('data',{}).get('meta',{});print(m.get('metadata_0',{}).get('current_leader') or '')" 2>/dev/null)
[ -n "$L" ] && { echo "$L"; return; }
done
}
EXPECT=() # accumulated tenant names that must always read back
for round in $(seq 1 10); do
echo "========== Drill F round $round =========="
# (1) WRITE a unique tenant through Raft, then remember it
TN="drillf-r${round}"
RESP=$(curl -s -X POST "http://127.0.0.1:$(port_of 1)/api/cluster/tenant/create" \
-H 'Content-Type: application/json' -d "{\"tenant_name\":\"$TN\",\"desc\":\"r$round\"}")
CODE=$(echo "$RESP" | python3 -c "import sys,json;print(json.load(sys.stdin).get('code'))" 2>/dev/null)
[ "$CODE" = "0" ] && { EXPECT+=("$TN"); echo " wrote tenant $TN (code=0)"; } || echo " !!! write $TN failed: $RESP"
# (4) kill the LEADER first, then the other two — build a SPACE-separated order
LEADER=$(meta_leader); echo " metadata leader = node$LEADER (killed first)"
ORDER="$LEADER"
for n in 1 2 3; do case " $ORDER " in *" $n "*) ;; *) ORDER="$ORDER $n";; esac; done
for c in $ORDER; do
echo "--- op: node$c ---"; graceful_kill "$c"; restart_node "$c"
done
# let caches/consensus settle after the last restart
sleep 4
# (2)+(3) consensus convergence + monotonicity, read from all 3 nodes
python3 - "$round" <<'PY'
import sys,json,urllib.request
rnd=sys.argv[1]; ports=[58080,58082,58083]; snaps={}
for p in ports:
try: snaps[p]=json.load(urllib.request.urlopen(f"http://127.0.0.1:{p}/api/info",timeout=5)).get('data',{}).get('meta',{})
except Exception as e: snaps[p]=None; print(f" !!! node@{p} /api/info failed: {e}")
groups=sorted({g for m in snaps.values() if m for g in m})
ok=True
for g in groups:
rows={p:(m.get(g) if m else None) for p,m in snaps.items()}
la={p:(r.get('last_applied') or {}).get('index') for p,r in rows.items() if r}
tm={p:r.get('current_term') for p,r in rows.items() if r}
ld={p:r.get('current_leader') for p,r in rows.items() if r}
mb={p:r.get('membership_config',{}).get('membership',{}).get('configs') for p,r in rows.items() if r}
conv = len(set(la.values()))<=1 and len(set(tm.values()))<=1 and len(set(ld.values()))<=1
print(f" [{g}] last_applied={la} term={tm} leader={ld} members={list(mb.values())[0] if mb else None} converged={'YES' if conv else 'NO <<<'}")
if not conv: ok=False
print(" CONSENSUS:", "CONVERGED" if ok else "DIVERGED <<<<<")
PY
# (1) read-back: every accumulated tenant must be present on ALL 3 nodes.
# NOTE: pass a large limit — list endpoints default to limit=10 (see
# parse_limit in admin-server/src/tool/query.rs). Without it, the 11th+ row
# (e.g. the lexicographically-last "drillf-r9" among r1..r10) is silently
# paginated out and read-back falsely reports it MISSING.
MISS=0
for p in 58080 58082 58083; do
HAVE=$(curl -s "http://127.0.0.1:$p/api/cluster/tenant/list?limit=100000" | python3 -c "import sys,json;print(' '.join(r['tenant_name'] for r in json.load(sys.stdin).get('data',{}).get('data',[])))" 2>/dev/null)
for t in "${EXPECT[@]}"; do case " $HAVE " in *" $t "*) ;; *) echo " !!! node@$p MISSING tenant $t"; MISS=$((MISS+1));; esac; done
done
[ "$MISS" = 0 ] && echo " READ-BACK: all ${#EXPECT[@]} tenants present on all 3 nodes" || echo " READ-BACK: $MISS MISSING <<<<<"
# error/panic scan for the round
PANIC=$(( $(grep -c 'invalid state: expect' /tmp/n1.log)+$(grep -c 'invalid state: expect' /tmp/n2.log)+$(grep -c 'invalid state: expect' /tmp/n3.log) ))
NONE=$(( $(grep -c 'last_log_id=None' /tmp/n1.log)+$(grep -c 'last_log_id=None' /tmp/n2.log)+$(grep -c 'last_log_id=None' /tmp/n3.log) ))
echo " >>> round $round: raft_panic=$PANIC last_log_id_None=$NONE"
done
Expect every round: CONSENSUS: CONVERGED, READ-BACK: all N tenants present,
last_applied non-decreasing per group, leader killed exits gracefully, and
new leader elected. Then run the WARN/ERROR analysis below.
Log Analysis — run after every drill step
After each step, scan all node logs for ERROR/WARN and judge whether each is expected. This is how a real regression was caught (a busy-loop that left a node at 99% CPU while the cluster still looked "healthy").
# ERROR count per node — should be 0
for c in 1 2 3; do echo "node$c ERROR: $(grep -c ERROR /tmp/n$c.log 2>/dev/null)"; done
# WARN types per node
for c in 1 2 3; do
echo "--- node$c WARN ---"
grep WARN /tmp/n$c.log 2>/dev/null | sed 's/\x1b\[[0-9;]*m//g' | grep -oE "WARN [a-z_:]+" | sort | uniq -c | sort -rn
done
# broker_node_list must list all live nodes (a missing live node = heartbeat/registration bug)
curl -s http://localhost:58080/api/info | python3 -c "import sys,json;d=json.load(sys.stdin).get('data',{});print('node_ids:', sorted(n.get('node_id') for n in d.get('broker_node_list',[])))"
# CPU per node — a node stuck near 100% = busy loop (RN state), not normal
for c in 1 2 3; do P=$(pgrep -f "server-$c.toml"); [ -n "$P" ] && ps -p $P -o pid,stat,%cpu | tail -1; done
Which WARNs are benign vs. a real problem
| Log | Verdict |
|---|---|
openraft::core::raft_core: membership_log_id changed ... ignore |
Benign — concurrent membership change dedup during join |
meta_service::raft::manager: Peer N not reachable: Connection refused (at startup) |
Benign — peers not up yet; first node bootstraps |
openraft::replication: ... Unreachable node ERROR/WARN (+ heartbeat error / error replication to target) clustered around a kill |
Benign & transient — the leader briefly fails to replicate/heartbeat to the node you just killed; stops the moment that node restarts. Expected to appear only on the current leader, only at kill times, and NOT keep growing. A real problem only if it persists after the target is back up, or targets a node that is actually alive. |
mqtt_broker::system_topic: Failed to write ... Connection refused that keeps growing |
PROBLEM — a system-topic shard replica points at an unreachable/removed node. Check broker_node_list for a missing live node. |
Any ERROR |
PROBLEM — investigate. |
A node at ~99% CPU / RN state |
PROBLEM — busy loop (e.g. retry_call spinning). |
broker_node_list missing a live node |
PROBLEM — that node's heartbeat/registration failed (often a knock-on effect of a busy loop). |
Cleanup
pkill -9 -f "broker-server --conf"
Troubleshooting
| Symptom | Cause |
|---|---|
cargo run + Ctrl+C exits instantly with no shutdown logs |
Expected — cargo kills the child. Use the compiled binary. |
Node hangs in Waiting for Meta Service cluster to be ready |
Quorum lost (too many nodes down/starting). Keep ≥2 nodes healthy. |
snapshot not found on join |
Regression in get_current_snapshot / begin_receiving_snapshot (state.rs). |
Nodes fail to start after a few restart rounds; RaftCore exited ... invalid state: purge_upto > snapshot_last_log_id |
Logs are read back empty on restart. Confirm by grepping the startup logs for get_initial_state ... last_log_id=None and Clean the hole. Root cause was get_log_state/try_get_log_entries iterating without total_order_seek while the DB has a 10-byte fixed-prefix extractor (log keys are 17 bytes) — the reverse seek finds nothing and openraft thinks the whole log was lost, then purges past the snapshot. Fixed in raft/store/log.rs by setting ReadOptions::set_total_order_seek(true) on both iterators (regression test: test_get_log_state_survives_reopen_with_prefix_extractor). |
Drill F read-back reports a tenant MISSING but point-get / list?limit=big / total_count show it IS present |
Test artifact, NOT a bug. List endpoints default to limit=10 (parse_limit, admin-server/src/tool/query.rs). With >10 rows the lexicographically-last name is paginated out. Always query lists with an explicit large ?limit=. Verify with /api/cluster/tenant/list?limit=100000 and the total_count field before suspecting data loss. |
console subscriber server failed: Address already in use panic at startup |
Benign for co-located multi-node drills — all three nodes bind the same tokio-console port. Does not affect Raft or shutdown. Do NOT count it as a Raft panic; grep invalid state: expect to isolate the real Raft panic. |
Process won't exit on kill -INT, only kill -9 works |
Regression in signal handling (libc handler in daemon.rs) or a leave_cluster re-introduced into shutdown. |