Failure-mode matrix

A single-table reference for "what happens when X breaks" — mapping common failure modes to the operator's detection signal, the action it takes, roughly when it acts, and what it will not do. Use this to cross-check an incident in progress against the operator's expected behavior, and to decide which alerts to carry.

Timing assumes the shipped defaults: spec.pollInterval=2s, spec.failureThreshold=3, spec.failoverCooldown=5m, sidecar spec.sidecar.leaseTimeout=20s. Tighter / looser values shift everything proportionally.

Quick reference

Failure	Observable signal	Operator action	Operator time-to-act	Operator limitations
Primary pod killed	MySQL poll fails on primary; pod `NotReady`.	Debounces to `unreachable` → promotes replica via failover sequence (fence old primary, drain relay, promote, flip DNS).	≈ 6 s (detection) + up to 30 s (relay drain) ≈ 37 s total.	If replica is lagging, the lag interval's worth of committed transactions is lost. If the primary pod respawns inside 6 s, no failover — the pod is back too fast to debounce.
Replica pod killed	Poll fails on replica; replica becomes `unreachable` and `Degraded=True` with reason `Alert`.	Marks replica `unreachable`; narrows the `-replicas` Service selector to exclude it. No failover — the primary is still writable.	≈ 6 s.	Read-only workloads hitting the `-replicas` Service lose endpoints until the replica returns. Backups fall back to the primary (controlled by `maxLagSecondsForSource`).
Primary node lost	Primary pod goes `NotReady` + no node heartbeat.	Same as primary pod killed — fails over to the replica. Kubernetes separately reschedules the pod onto a healthy node.	≈ 37 s for the failover. Pod reschedule is Kubernetes's job and depends on eviction policy (30 s – 5 min).	If the PVC is zone-bound (e.g. `topology.kubernetes.io/zone`-pinned EBS) and the lost node was the zone's only node, the pod cannot reschedule until capacity returns. The operator has already failed over; the zone-bound PVC problem is a Kubernetes concern.
Replica node lost	Poll fails on replica.	Marks replica `unreachable`; narrows `-replicas` selector. No failover.	≈ 6 s.	Same as replica-pod-killed. Kubernetes reschedules the pod to another node with access to the PVC.
Primary PVC lost	Primary pod restarts into an empty MySQL or stuck bootstrap. GTID-empty detection fires on the next successful poll.	Fails over (if the primary was active) → after recovery, auto-clones the empty site from the current primary via `CLONE INSTANCE`.	Failover: ≈ 37 s. Clone duration depends on dataset size (minutes to hours).	Committed-but-unreplicated transactions on the lost PVC are unrecoverable. Auto-clone requires `MYSQL_REPLICATION_USER`/`PASSWORD` in the operator Secret (see Failover → Prerequisites).
AZ / site partition (operator can still reach the surviving side)	All poll targets at the isolated side time out; one side's pods go `NotReady`.	Fails over to the reachable site. Sidecar on the partitioned primary self-fences at T + `leaseTimeout` = T + 20 s, closing the split-brain window.	≈ 37 s (failover) to the reachable side. Partitioned side is fenced at 20 s regardless of operator progress.	If the partitioned side had unreplicated transactions, they are "stranded" until the partition heals — at which point the operator detects `divergentGtid` and blocks recovery until an admin runs the reclone flow.
Cross-site partition (link between MySQL pods dies; operator still reaches both)	Replica's IO thread stops receiving events; `Seconds_Behind_Source` climbs. Eventually `Degraded=True ReplicationLagging`.	None automatic. The primary is still writable, the replica is still read-only. The operator alerts but does not fail over.	Alert at `maxLagSeconds` (default 300 s).	This mode is indistinguishable, from the operator's point of view, from "replica fell behind because of I/O pressure". Human judgement decides whether to force a failover.
Operator pod crash	`up{job="bloodraven"} == 0`. Liveness probe fails. CR status stops updating.	Kubernetes restarts the pod. On return, leader election resumes, polling resumes.	Pod restart ≈ 5–10 s.	No failover decisions happen during downtime. If the primary fails while the operator is down, writes are unavailable until the operator restart + a full failover sequence completes. Sidecars self-fence to preserve correctness.
Operator down + one site fails	Combined: operator liveness fails AND one MySQL site becomes `unreachable`.	Sidecars self-fence at T + `leaseTimeout`; no promotion happens until the operator returns, then it runs the normal failover sequence.	Operator-down duration + ≈ 37 s.	Write availability is lost for the operator-down window plus the post-restart failover. Correctness is preserved (no split-brain), but availability takes a direct hit. RPO is unchanged vs. the operator-up case.
Both sites unreachable (TotalLoss)	All polls to both sites fail. `Degraded=True TotalLoss`.	None. Emits `TotalLossDetected` Event and alerts. No automatic recovery is possible — nothing to fail over to.	Immediate after `failureThreshold` polls.	Requires human intervention. Once at least one site is back, the operator resumes; if both return, split-brain handling takes over. See Operations → Total loss recovery.
Split brain (`writable/writable`)	Both sites poll as `writable` simultaneously.	Depends on `spec.splitBrainPolicy`. Default (`Manual`): alert via `SplitBrainDetected` Event; require human action. Opt-in (`PreferSite`): fence the non-preferred site. Post-operator-initiated failover: fence the returning stale primary immediately.	One poll cycle (≈ 2 s).	No automatic data-merge. If the fenced side accepted writes that the winning side doesn't have, those transactions surface as `divergentGtid` and require a reclone. See Failover → Split-brain resolution.
Old primary returns diverged	After a failover, the returning site has committed transactions the new primary never received.	Fences the returning site, records `status.sites[].divergentGtid` + count, sets `RecoveryPending=True` with reason `DivergentTransactions`, emits `DataLossDetected` Event.	One poll cycle after the site comes back (≈ 2 s).	Cannot auto-rejoin — divergent GTIDs would corrupt replication. Admin must run the reclone flow; the operator enforces the confirmation interlock on the annotation. See Failover → Divergence detected.
DNS provider down / slow	`DNSEndpoint` CR updates succeed, but application DNS lookups still return the old A-record.	Operator writes the new `DNSEndpoint` on failover and moves on — DNS propagation is external-dns's job, not the operator's.	CR update happens within the failover sequence (a few seconds). External-dns reconcile time and your DNS provider's TTL decide the app-visible delay.	The operator cannot accelerate DNS propagation. Applications should use short TTLs (`spec.dns.ttl`, default 60 s) and not cache MySQL connection addresses longer than the TTL. Monitor external-dns independently; a stuck external-dns is an outage for writes even after the operator has "finished".
S3 (or backup storage) unreachable	Backup Jobs fail; `bloodraven_archiver_upload_failures` climbs; `bloodraven_archiver_backlog_files` > 0; PITR `lastUploadAt` stops advancing.	Backup reconciler retries per `spec.backup.retry`. PITR archiver retries on every scan (`archivePollInterval`, default 60 s).	Retry backoffs are bounded by the retry spec; next archival attempt is one poll away.	No data-plane impact — MySQL keeps serving writes. PITR RPO silently drifts as unarchived binlogs accumulate. Alert on `bloodraven_archiver_backlog_files > 0` sustained. See Monitoring.
Anti-flap cooldown blocks needed failover	Second failover condition arises within `failoverCooldown` of the previous one.	Operator logs `failover blocked by anti-flap cooldown` and does not fail over.	No automatic action until the cooldown expires (default 5 min).	The planned-failover API also honours this cooldown and will be rejected with `reason: CooldownActive`. If you genuinely need a second failover inside the cooldown window, perform a manual promotion — that path bypasses the cooldown explicitly.
Planned failover target fails to catch up	Admin annotated `bloodraven.shipstream.io/planned-failover=<site>`; target's `GTID_EXECUTED` does not cover source's fenced GTID within `maxLagWait`.	Operator unfences the source, stamps `status.plannedFailover.phase: Failed` with `reason: LagTimeout`, emits `PlannedFailoverFailed` Event. `role=primary` label is restored; no DNS flip occurred.	`maxLagWait` (default 5 min).	No data loss; cluster returns to pre-annotation state. Retry after replication has caught up, or use emergency failover if the source is truly gone. See Planned failover → Rollback.
Dragonfly master killed	`spec.dragonfly.enabled=true`; `status.dragonfly.activeSite` is unreachable, and one replica is reachable and not syncing/loading.	Promotes the replica with `REPLTAKEOVER` or fallback promotion, updates Dragonfly role/traffic labels, and leaves MySQL `status.activeSite` unchanged.	Next Dragonfly manager poll plus promotion budget.	Cache/session continuity is best-effort. If no healthy Dragonfly replica exists, the active Dragonfly Service may have no endpoints until recovery. MySQL availability is unaffected.
Dragonfly unavailable during MySQL emergency failover	MySQL primary is failed, and Dragonfly target/source cannot be reached or promoted.	Completes MySQL failover first. Dragonfly promotion is attempted best-effort and may record `bloodraven_dragonfly_promotions_total{result="failed"}`.	MySQL path remains ≈ 37 s; Dragonfly attempt is bounded separately.	Sessions/cache may be lost or unavailable. Dragonfly never blocks emergency MySQL recovery.
Planned Dragonfly sync timeout	Planned failover enters `WaitingForDragonflySync`, but the target Dragonfly replica does not reach the source offset within `maxSyncWait`.	With `onSyncTimeout=proceed` (default), promotes MySQL and stamps `sessionsPreserved=false`. With `onSyncTimeout=fail`, rolls back before MySQL promotion.	`spec.dragonfly.plannedFailover.maxSyncWait` (default 30 s).	`proceed` preserves MySQL availability at the cost of cache/session continuity. `fail` preserves the old active site and avoids claiming session preservation.
Replication credentials missing / wrong	Auto-clone fails; `Bootstrapping` condition stays in `Failed`.	Operator reports the failure via the `Bootstrapping` condition and events; does not keep retrying indefinitely.	Immediate (on the first bootstrap attempt).	Without `MYSQL_REPLICATION_USER` / `MYSQL_REPLICATION_PASSWORD`, old-primary recovery and auto-clone are both disabled. The rest of the operator (polling, failover, DNS) keeps working.

Backup encryption failure modes

These rows apply only when a backup profile has spec.backup.profiles[].encryption set. See Backup encryption for the full threat model.

Failure	Detection path	Operator action	Typical time-to-act	Notes & limits
Passphrase Secret missing at restore time	Restore Job init container can't mount `spec.initFromBackup.decryption.passphraseSecret`; `kubectl describe pod` shows `CreateContainerConfigError`.	Reconciler records `DecryptionSecretMissing` on the restore CR; stops creating retry Jobs after `backoffLimit`.	Seconds (Job pod fails to start).	Recovery: recreate the Secret with the original passphrase value, or delete & re-apply the restore CR. Ciphertext is unrecoverable without the exact passphrase used at backup time — treat the Secret as critical recovery material.
Passphrase rotated between backup and restore	`decrypt-download` init container exits non-zero with a chunk-auth failure from `backupcrypto`.	Job pod's logs show `backupcrypto: chunk 0: cipher: message authentication failed`; reconciler reports `DecryptionFailed`.	Immediate on first read.	AES-GCM catches mismatched keys cleanly — there's no silent garbage load. Recovery requires producing the old passphrase; there is no re-wrap path today.
Ciphertext truncated by S3 multipart abort	`decrypt-download` returns `backupcrypto: ciphertext truncated before final chunk` (`ErrTruncated`).	Job pod fails fast; reconciler reports `BackupCiphertextTruncated`.	Immediate.	`ErrTruncated` distinguishes "tail missing" from a generic auth failure. Recover by re-running the source backup and restoring from the replacement artifact.
Object has no BRV1 magic (tamper / downgrade)	`decrypt-download` returns `sidecar: object missing BRV1 magic while encryption is required` (`ErrTamperedOrDowngrade`).	Job pod fails fast; reconciler reports `BackupDecryptionRejected`.	Immediate.	Defends against an attacker with bucket write access overwriting ciphertext with attacker-chosen plaintext. Legacy mixed-encryption deployments migrating to encryption can opt back into the old passthrough behavior via `BLOODRAVEN_ALLOW_PLAINTEXT_FALLBACK=1` (sidecar / decrypt-download init container) — time-bounded migration only.

How to read this table for your on-call

Two common uses:

"What's the operator doing right now?" Find the row matching the symptom, read the Operator action and time-to-act cells to set your clock for when the situation should resolve on its own. If the expected time has passed and nothing has changed, check the Operator limitations column.
"Do I need to page a human?" Alert on the rows whose Operator action is "none automatic" (cross-site partition, total loss, split brain, DNS provider, S3). These are the cases where the operator has deliberately deferred to a human.

Failover → State machine — the per-site and cross-site truth tables that drive every row's Operator action cell.
Failover → Failover sequence — the step-by-step sequence that produces the "≈ 37 s" number.
Operations — runbooks for the rows where the operator action is "none automatic".
Monitoring — the metrics and events each row's Observable signal column refers to.
Production hardening — which rows here the hardening checklist is specifically guarding against.

Quick reference​

Backup encryption failure modes​

How to read this table for your on-call​

Related reading​

Quick reference

Backup encryption failure modes

How to read this table for your on-call

Related reading