Why not Group Replication?
The single most common question about Bloodraven is: "Why run async replication with an external operator when MySQL ships both Group Replication (GR) and InnoDB Cluster?" This page is the architectural answer so readers can stop asking and pick the right tool for their situation.
TL;DR
Bloodraven is optimized for the two-site, geographically-separated, accept-non-zero-RPO deployment. Group Replication is optimized for the three-or-more-node, low-latency, zero-RPO deployment. They're different design points, and Bloodraven solves problems Group Replication doesn't — most importantly: staying writable when the cross-site link is slow, flappy, or down.
What Bloodraven trades away
- RPO is not zero. A hard primary loss can lose every transaction that committed on the dying primary but hadn't yet shipped to the replica over async replication. Under healthy operation this window is typically sub-second, but it exists. If your data model cannot accept this, Bloodraven is the wrong tool — use Group Replication (or a higher-tier system like Spanner / CockroachDB).
- No in-database conflict resolution. Because only one site writes at a time, Bloodraven cannot merge concurrent writes from two sites. Conflicts are resolved by the operator's "one primary at a time" invariant; anything that breaks that invariant (split brain) is surfaced as a condition for a human, not silently merged.
What Bloodraven keeps
Zero commit latency. A primary write acknowledges as soon as it has fsynced to the local binlog — the same latency profile as a standalone MySQL. No quorum round-trip, no cross-site ACK. For two sites separated by ≥ 20 ms of network latency, GR's certification
- quorum on every commit typically adds 40–80 ms to p50 write latency. Bloodraven's async model adds zero.
Single-node write availability. The primary accepts writes even when the other site is unreachable. GR requires a majority of the group online to accept writes, so losing half of a two-node group means the remaining node is read-only; losing two of three means the whole cluster is read-only. Bloodraven will promote a surviving site to writable the moment it detects the primary is gone and the replica is healthy (see Failover sequence).
No quorum requirement. Two sites is a legitimate topology. GR with two members is a pathological configuration — any partition or node loss makes the group inquorate. Operators who want two-site HA with GR are forced to invent a "witness" third node somewhere, which introduces its own set of cross-region headaches (where does the witness live? What happens when the witness is isolated?). Bloodraven's sidecar self-fencing layer performs the "am I still authoritative?" check without a quorum.
Simpler mental model. There is always exactly one primary; the other site is a replica or is fenced. Topology never has to pick between N possible primaries, negotiate a view change, or resolve certification conflicts. The operator's state machine has four per-site states and a small cross-site truth table — a single developer can hold the whole thing in their head, which is load-bearing when you're debugging at 03:00.
Works across zones with real latency. Group Replication's paper-published performance numbers assume sub-millisecond inter-node latency, because every commit serializes through the group's certification protocol. At 20-100 ms cross-region latency (the typical two-datacenter or two-region deployment), GR is functional but expensive on write throughput, and any network blip triggers a view change. Async replication + a supervisor is the standard answer at that latency tier for a reason.
Sidecar self-fencing. The sidecar on each MySQL pod refuses to
accept writes if it can reach neither the operator nor the peer for
spec.sidecar.leaseTimeout (default 20 s). This closes the split-brain
window that async replication alone would leave open. See the
Sidecar description.
When Group Replication is actually the right answer
Bloodraven is not always the right answer — be honest about when it isn't:
- Zero-RPO is a product requirement. Financial ledgers, inventory-as-source-of-truth, anything where "we lost a second of writes" is a customer-visible failure. Group Replication is what you want.
- Three or more nodes are already on the table. If your topology already has three MySQL nodes for HA (and the write path is willing to pay quorum latency), GR turns that into synchronous-ish replication with no external operator. Bloodraven's two-site architecture doesn't fit.
- Low inter-node latency. Single-AZ, single-DC, single-rack deployments don't pay GR's latency cost, because the cost is small in that environment.
- You cannot tolerate split-brain resolution by human. Bloodraven's
response to
writable/writableis to alert and wait for an operator (or, opt-in, to fence a pre-configured loser — seespec.splitBrainPolicy). If your runbook requires the cluster to auto-pick a winner in every case without data reconciliation, use GR.
The honest tradeoff
Bloodraven and Group Replication solve the same top-level problem ("keep MySQL writable when bad things happen") from two different vantage points:
| Concern | Group Replication | Bloodraven |
|---|---|---|
| RPO on hard primary loss | 0 | ≈ secondsBehindSource of the replica at failure |
| Commit latency | 1 cross-node round-trip | 1 local fsync |
| Minimum nodes to tolerate 1 failure | 3 | 2 |
| Write availability during a partition | Majority side only | The reachable side (operator arbitrates) |
| Conflict resolution | Certification (may abort commits) | Single-writer invariant (no conflicts possible) |
| Operational complexity | View changes, certification, group membership | Primary/replica + one external operator |
| Typical inter-node latency sweet spot | < 5 ms | Doesn't care; tested at 20–100+ ms |
| Supervisor required for DNS/traffic steering | Yes (MySQL Router / InnoDB Cluster) | Yes (Bloodraven itself) |
Bloodraven picks the column on the right of every row. If the column on the left describes your situation better, run Group Replication.
Related reading
- Architecture — how the operator, sidecars, and Services fit together.
- Failover — state machine, failover sequence, anti-flap cooldown, split-brain handling.
- Getting started — stand up a two-site failover group end-to-end.