Skip to main content

Network partitions

network partitions infographic

This runbook expands the network rows in the failure-mode matrix. It focuses on what the operator can observe, which action it takes, which metrics/events move, and what an operator should do after connectivity returns.

Timing assumes the defaults: pollInterval=2s, failureThreshold=3, sidecar.leaseTimeout=20s, and a 30-second relay-log drain timeout.

First principles

  • Bloodraven promotes only when the current primary is unreachable and a promotable primary-candidate replica is reachable.
  • A replica-side network problem does not trigger failover while the primary remains writable.
  • The sidecar self-fences a writable MySQL when it cannot reach the operator and cannot reach any peer for longer than leaseTimeout.
  • Cross-site replication lag is an alert, not a failover trigger, when the primary is otherwise healthy.
  • Once a partition heals, GTID comparison decides whether the returning site can rejoin automatically or must be recloned.

Scenario A: operator cannot reach site A, site B reachable

Example: the active primary is iad; the operator and pdx are on the surviving side of a site partition; all operator polls to iad time out.

AspectExpected behavior
Observable signaliad transitions to unreachable; pdx remains read-only and replicating until the link fully breaks.
Operator actionAfter debounce, promotes pdx via the normal emergency failover path, updates Services/DNS, and taints old-site nodes.
Sidecar actionIf iad is still running but isolated from both operator and peers, its sidecar sets super_read_only=ON at roughly T+20s.
Metricsbloodraven_site_state{site="iad",state="unreachable"}=1, bloodraven_failovers_total{target_site="pdx"} increments on successful promotion, bloodraven_dns_flips_total{site="pdx"} increments when DNS is updated, bloodraven_taint_operations_total{site="iad",action="taint"} increments.
EventsExpect the same failover lifecycle events as an unreachable primary: failover started/completed and, if the old primary later returns diverged, DataLossDetected.
RPOBounded by what had replicated to pdx before the partition. Any writes accepted only on iad after the partition are divergent when iad returns.

Recovery after heal:

  1. Watch status.sites[?(@.name=="iad")].recoveryState.
  2. If there is no divergence, the operator reconfigures iad as a replica automatically.
  3. If divergentGtid is set, review lost transactions and trigger the reclone flow in Operations.

Scenario B: replica site isolated, primary reachable

Example: active primary iad remains reachable by the operator; replica pdx cannot receive replication traffic or cannot be polled.

AspectExpected behavior
Observable signalpdx becomes unreachable, or replication IO/SQL stops and secondsBehindSource climbs. iad remains writable.
Operator actionNo failover. The primary is healthy, so promoting away would reduce availability and risk data loss unnecessarily.
Sidecar actionReplica sidecar does not self-fence a read-only MySQL; replicas are already not accepting writes.
Metricsbloodraven_site_state{site="pdx",state="unreachable"}=1 or bloodraven_replication_lag_seconds rises; bloodraven_replication_running{site="pdx",thread="io"} may become 0.
EventsDegraded/alert events for unreachable or lagging replication; no failover-complete event.
RPOEmergency failover is not available until a primary-candidate replica is reachable and reasonably current.

Recovery after heal:

  1. Confirm replication restarts and lag falls below spec.replication.maxLagSeconds.
  2. If replication does not restart, inspect MySQL replica status and operator logs for old primary recovery failed or replication errors.
  3. If the replica has been manually written to and now diverges, reclone it from the current primary.

Example: the operator can poll both iad and pdx, but pdx cannot pull binlog events from iad.

AspectExpected behavior
Observable signalPrimary remains writable; replica remains read-only; replica IO thread stops or lag increases.
Operator actionNo automatic failover. From the operator's perspective this looks like replication lag or IO pressure, not primary failure.
Metricsbloodraven_replication_running{thread="io"}=0 and/or bloodraven_replication_lag_seconds exceeds the configured threshold.
EventsReplication-lagging / degraded events; no DNS flip and no taint change.
Human actionDecide whether the link outage is temporary. If the primary is healthy, keep serving writes there. If you need to move writes, use planned failover only after the replica catches up.

This is the scenario most likely to page a human without automatic action. Bloodraven intentionally refuses to guess that lag means the primary should be abandoned.

Scenario D: asymmetric peer reachability

Example: operator reaches both sites, iad can reach pdx, but pdx cannot reach iad; or only one sidecar can reach its peer.

AspectExpected behavior
Observable signalPoll results can remain healthy while replication or peer checks show one-way failures.
Operator actionFollows the data-plane state it can poll: no failover while the active primary is reachable and writable.
Sidecar actionA writable primary self-fences only when both the operator and every peer are unreachable beyond leaseTimeout. One reachable peer is enough to avoid self-fencing.
MetricsUsually replication lag/running metrics, plus possible state transitions if MySQL polling is affected.
EventsDegraded/replication events unless the asymmetry also makes the primary unreachable to the operator.

Asymmetric partitions are exactly why GTID reconciliation after heal matters. If either side accepted writes that the final primary did not receive, the returning site is fenced and marked divergent.

Scenario E: both sites unreachable to the operator

AspectExpected behavior
Observable signalAll sites become unreachable; Degraded=True with total-loss semantics.
Operator actionNo promotion. There is no reachable candidate to promote.
Sidecar actionAny still-running writable site self-fences after leaseTimeout if it cannot reach operator or peers.
Metricsbloodraven_site_state{state="unreachable"}=1 for all sites; no failover counter increment.
EventsTotal-loss / degraded events.
Human actionRestore at least one site or the operator network path. Once one site is reachable, the operator resumes normal reconciliation.

Testing partitions

Use pod-level NetworkPolicy or CNI-native fault injection. Host-level iptables rules in a k3d node are not reliable for Kubernetes Service traffic because kube-proxy DNAT and pod networking happen in different paths.

Playground examples live in playground/chaos-scenarios.md and use NetworkPolicy-based partitions. Always run ./playground/chaos.sh recover or delete the NetworkPolicy manually after a test.

On-call checklist

  1. Identify the current active site: kubectl get mysqlfailovergroup orders -o jsonpath='{.status.activeSite}'.
  2. Print per-site state: kubectl get mysqlfailovergroup orders -o jsonpath='{range .status.sites[*]}{.name}: {.state} lag={.secondsBehindSource} recovery={.recoveryState}{"\n"}{end}'.
  3. Check whether bloodraven_failovers_total increased. If not, the operator likely chose to alert rather than promote.
  4. If a site returns with divergentGtid, do not manually attach it as a replica. Follow the reclone flow.
  5. After recovery, verify replication lag is below your RPO threshold and that DNS/external-dns has converged if clients use external DNS.