Skip to main content

Why not Group Replication?

why not group replication infographic

The single most common question about Bloodraven is: "Why run async replication with an external operator when MySQL ships both Group Replication (GR) and InnoDB Cluster?" This page is the architectural answer so readers can stop asking and pick the right tool for their situation.

TL;DR

Bloodraven is optimized for the two-site, geographically-separated, accept-non-zero-RPO deployment. Group Replication is optimized for the three-or-more-node, low-latency, zero-RPO deployment. They're different design points, and Bloodraven solves problems Group Replication doesn't — most importantly: staying writable when the cross-site link is slow, flappy, or down.

What Bloodraven trades away

  • RPO is not zero. A hard primary loss can lose every transaction that committed on the dying primary but hadn't yet shipped to the replica over async replication. Under healthy operation this window is typically sub-second, but it exists. If your data model cannot accept this, Bloodraven is the wrong tool — use Group Replication (or a higher-tier system like Spanner / CockroachDB).
  • No in-database conflict resolution. Because only one site writes at a time, Bloodraven cannot merge concurrent writes from two sites. Conflicts are resolved by the operator's "one primary at a time" invariant; anything that breaks that invariant (split brain) is surfaced as a condition for a human, not silently merged.

What Bloodraven keeps

Zero commit latency. A primary write acknowledges as soon as it has fsynced to the local binlog — the same latency profile as a standalone MySQL. No quorum round-trip, no cross-site ACK. For two sites separated by ≥ 20 ms of network latency, GR's certification

  • quorum on every commit typically adds 40–80 ms to p50 write latency. Bloodraven's async model adds zero.

Single-node write availability. The primary accepts writes even when the other site is unreachable. GR requires a majority of the group online to accept writes, so losing half of a two-node group means the remaining node is read-only; losing two of three means the whole cluster is read-only. Bloodraven will promote a surviving site to writable the moment it detects the primary is gone and the replica is healthy (see Failover sequence).

No quorum requirement. Two sites is a legitimate topology. GR with two members is a pathological configuration — any partition or node loss makes the group inquorate. Operators who want two-site HA with GR are forced to invent a "witness" third node somewhere, which introduces its own set of cross-region headaches (where does the witness live? What happens when the witness is isolated?). Bloodraven's sidecar self-fencing layer performs the "am I still authoritative?" check without a quorum.

Simpler mental model. There is always exactly one primary; the other site is a replica or is fenced. Topology never has to pick between N possible primaries, negotiate a view change, or resolve certification conflicts. The operator's state machine has four per-site states and a small cross-site truth table — a single developer can hold the whole thing in their head, which is load-bearing when you're debugging at 03:00.

Works across zones with real latency. Group Replication's paper-published performance numbers assume sub-millisecond inter-node latency, because every commit serializes through the group's certification protocol. At 20-100 ms cross-region latency (the typical two-datacenter or two-region deployment), GR is functional but expensive on write throughput, and any network blip triggers a view change. Async replication + a supervisor is the standard answer at that latency tier for a reason.

Sidecar self-fencing. The sidecar on each MySQL pod refuses to accept writes if it can reach neither the operator nor the peer for spec.sidecar.leaseTimeout (default 20 s). This closes the split-brain window that async replication alone would leave open. See the Sidecar description.

When Group Replication is actually the right answer

Bloodraven is not always the right answer — be honest about when it isn't:

  • Zero-RPO is a product requirement. Financial ledgers, inventory-as-source-of-truth, anything where "we lost a second of writes" is a customer-visible failure. Group Replication is what you want.
  • Three or more nodes are already on the table. If your topology already has three MySQL nodes for HA (and the write path is willing to pay quorum latency), GR turns that into synchronous-ish replication with no external operator. Bloodraven's two-site architecture doesn't fit.
  • Low inter-node latency. Single-AZ, single-DC, single-rack deployments don't pay GR's latency cost, because the cost is small in that environment.
  • You cannot tolerate split-brain resolution by human. Bloodraven's response to writable/writable is to alert and wait for an operator (or, opt-in, to fence a pre-configured loser — see spec.splitBrainPolicy). If your runbook requires the cluster to auto-pick a winner in every case without data reconciliation, use GR.

The honest tradeoff

Bloodraven and Group Replication solve the same top-level problem ("keep MySQL writable when bad things happen") from two different vantage points:

ConcernGroup ReplicationBloodraven
RPO on hard primary loss0secondsBehindSource of the replica at failure
Commit latency1 cross-node round-trip1 local fsync
Minimum nodes to tolerate 1 failure32
Write availability during a partitionMajority side onlyThe reachable side (operator arbitrates)
Conflict resolutionCertification (may abort commits)Single-writer invariant (no conflicts possible)
Operational complexityView changes, certification, group membershipPrimary/replica + one external operator
Typical inter-node latency sweet spot< 5 msDoesn't care; tested at 20–100+ ms
Supervisor required for DNS/traffic steeringYes (MySQL Router / InnoDB Cluster)Yes (Bloodraven itself)

Bloodraven picks the column on the right of every row. If the column on the left describes your situation better, run Group Replication.

  • Architecture — how the operator, sidecars, and Services fit together.
  • Failover — state machine, failover sequence, anti-flap cooldown, split-brain handling.
  • Getting started — stand up a two-site failover group end-to-end.