Network partitions

This runbook expands the network rows in the failure-mode matrix. It focuses on what the operator can observe, which action it takes, which metrics/events move, and what an operator should do after connectivity returns.

Timing assumes the defaults: pollInterval=2s, failureThreshold=3, sidecar.leaseTimeout=20s, and a 30-second relay-log drain timeout.

First principles

Bloodraven promotes only when the current primary is unreachable and a promotable primary-candidate replica is reachable.
A replica-side network problem does not trigger failover while the primary remains writable.
The sidecar self-fences a writable MySQL when it cannot reach the operator and cannot reach any peer for longer than leaseTimeout.
Cross-site replication lag is an alert, not a failover trigger, when the primary is otherwise healthy.
Once a partition heals, GTID comparison decides whether the returning site can rejoin automatically or must be recloned.

Scenario A: operator cannot reach site A, site B reachable

Example: the active primary is iad; the operator and pdx are on the surviving side of a site partition; all operator polls to iad time out.

Aspect	Expected behavior
Observable signal	`iad` transitions to `unreachable`; `pdx` remains `read-only` and replicating until the link fully breaks.
Operator action	After debounce, promotes `pdx` via the normal emergency failover path, updates Services/DNS, and taints old-site nodes.
Sidecar action	If `iad` is still running but isolated from both operator and peers, its sidecar sets `super_read_only=ON` at roughly T+20s.
Metrics	`bloodraven_site_state{site="iad",state="unreachable"}=1`, `bloodraven_failovers_total{target_site="pdx"}` increments on successful promotion, `bloodraven_dns_flips_total{site="pdx"}` increments when DNS is updated, `bloodraven_taint_operations_total{site="iad",action="taint"}` increments.
Events	Expect the same failover lifecycle events as an unreachable primary: failover started/completed and, if the old primary later returns diverged, `DataLossDetected`.
RPO	Bounded by what had replicated to `pdx` before the partition. Any writes accepted only on `iad` after the partition are divergent when `iad` returns.

Recovery after heal:

Watch status.sites[?(@.name=="iad")].recoveryState.
If there is no divergence, the operator reconfigures iad as a replica automatically.
If divergentGtid is set, review lost transactions and trigger the reclone flow in Operations.

Scenario B: replica site isolated, primary reachable

Example: active primary iad remains reachable by the operator; replica pdx cannot receive replication traffic or cannot be polled.

Aspect	Expected behavior
Observable signal	`pdx` becomes `unreachable`, or replication IO/SQL stops and `secondsBehindSource` climbs. `iad` remains `writable`.
Operator action	No failover. The primary is healthy, so promoting away would reduce availability and risk data loss unnecessarily.
Sidecar action	Replica sidecar does not self-fence a read-only MySQL; replicas are already not accepting writes.
Metrics	`bloodraven_site_state{site="pdx",state="unreachable"}=1` or `bloodraven_replication_lag_seconds` rises; `bloodraven_replication_running{site="pdx",thread="io"}` may become `0`.
Events	Degraded/alert events for unreachable or lagging replication; no failover-complete event.
RPO	Emergency failover is not available until a primary-candidate replica is reachable and reasonably current.

Recovery after heal:

Confirm replication restarts and lag falls below spec.replication.maxLagSeconds.
If replication does not restart, inspect MySQL replica status and operator logs for old primary recovery failed or replication errors.
If the replica has been manually written to and now diverges, reclone it from the current primary.

Scenario C: MySQL-to-MySQL link broken, operator reaches both

Example: the operator can poll both iad and pdx, but pdx cannot pull binlog events from iad.

Aspect	Expected behavior
Observable signal	Primary remains `writable`; replica remains `read-only`; replica IO thread stops or lag increases.
Operator action	No automatic failover. From the operator's perspective this looks like replication lag or IO pressure, not primary failure.
Metrics	`bloodraven_replication_running{thread="io"}=0` and/or `bloodraven_replication_lag_seconds` exceeds the configured threshold.
Events	Replication-lagging / degraded events; no DNS flip and no taint change.
Human action	Decide whether the link outage is temporary. If the primary is healthy, keep serving writes there. If you need to move writes, use planned failover only after the replica catches up.

This is the scenario most likely to page a human without automatic action. Bloodraven intentionally refuses to guess that lag means the primary should be abandoned.

Scenario D: asymmetric peer reachability

Example: operator reaches both sites, iad can reach pdx, but pdx cannot reach iad; or only one sidecar can reach its peer.

Aspect	Expected behavior
Observable signal	Poll results can remain healthy while replication or peer checks show one-way failures.
Operator action	Follows the data-plane state it can poll: no failover while the active primary is reachable and writable.
Sidecar action	A writable primary self-fences only when both the operator and every peer are unreachable beyond `leaseTimeout`. One reachable peer is enough to avoid self-fencing.
Metrics	Usually replication lag/running metrics, plus possible state transitions if MySQL polling is affected.
Events	Degraded/replication events unless the asymmetry also makes the primary unreachable to the operator.

Asymmetric partitions are exactly why GTID reconciliation after heal matters. If either side accepted writes that the final primary did not receive, the returning site is fenced and marked divergent.

Scenario E: both sites unreachable to the operator

Aspect	Expected behavior
Observable signal	All sites become `unreachable`; `Degraded=True` with total-loss semantics.
Operator action	No promotion. There is no reachable candidate to promote.
Sidecar action	Any still-running writable site self-fences after `leaseTimeout` if it cannot reach operator or peers.
Metrics	`bloodraven_site_state{state="unreachable"}=1` for all sites; no failover counter increment.
Events	Total-loss / degraded events.
Human action	Restore at least one site or the operator network path. Once one site is reachable, the operator resumes normal reconciliation.

Testing partitions

Use pod-level NetworkPolicy or CNI-native fault injection. Host-level iptables rules in a k3d node are not reliable for Kubernetes Service traffic because kube-proxy DNAT and pod networking happen in different paths.

Playground examples live in playground/chaos-scenarios.md and use NetworkPolicy-based partitions. Always run ./playground/chaos.sh recover or delete the NetworkPolicy manually after a test.

On-call checklist

Identify the current active site: kubectl get mysqlfailovergroup orders -o jsonpath='{.status.activeSite}'.
Print per-site state: kubectl get mysqlfailovergroup orders -o jsonpath='{range .status.sites[*]}{.name}: {.state} lag={.secondsBehindSource} recovery={.recoveryState}{"\n"}{end}'.
Check whether bloodraven_failovers_total increased. If not, the operator likely chose to alert rather than promote.
If a site returns with divergentGtid, do not manually attach it as a replica. Follow the reclone flow.
After recovery, verify replication lag is below your RPO threshold and that DNS/external-dns has converged if clients use external DNS.

First principles​

Scenario A: operator cannot reach site A, site B reachable​

Scenario B: replica site isolated, primary reachable​

Scenario C: MySQL-to-MySQL link broken, operator reaches both​

Scenario D: asymmetric peer reachability​

Scenario E: both sites unreachable to the operator​

Testing partitions​

On-call checklist​