Skip to main content

Failure-mode matrix

failure mode matrix infographic

A single-table reference for "what happens when X breaks" — mapping common failure modes to the operator's detection signal, the action it takes, roughly when it acts, and what it will not do. Use this to cross-check an incident in progress against the operator's expected behavior, and to decide which alerts to carry.

Timing assumes the shipped defaults: spec.pollInterval=2s, spec.failureThreshold=3, spec.failoverCooldown=5m, sidecar spec.sidecar.leaseTimeout=20s. Tighter / looser values shift everything proportionally.

Quick reference

FailureObservable signalOperator actionOperator time-to-actOperator limitations
Primary pod killedMySQL poll fails on primary; pod NotReady.Debounces to unreachable → promotes replica via failover sequence (fence old primary, drain relay, promote, flip DNS).≈ 6 s (detection) + up to 30 s (relay drain) ≈ 37 s total.If replica is lagging, the lag interval's worth of committed transactions is lost. If the primary pod respawns inside 6 s, no failover — the pod is back too fast to debounce.
Replica pod killedPoll fails on replica; replica becomes unreachable and Degraded=True with reason Alert.Marks replica unreachable; narrows the -replicas Service selector to exclude it. No failover — the primary is still writable.≈ 6 s.Read-only workloads hitting the -replicas Service lose endpoints until the replica returns. Backups fall back to the primary (controlled by maxLagSecondsForSource).
Primary node lostPrimary pod goes NotReady + no node heartbeat.Same as primary pod killed — fails over to the replica. Kubernetes separately reschedules the pod onto a healthy node.≈ 37 s for the failover. Pod reschedule is Kubernetes's job and depends on eviction policy (30 s – 5 min).If the PVC is zone-bound (e.g. topology.kubernetes.io/zone-pinned EBS) and the lost node was the zone's only node, the pod cannot reschedule until capacity returns. The operator has already failed over; the zone-bound PVC problem is a Kubernetes concern.
Replica node lostPoll fails on replica.Marks replica unreachable; narrows -replicas selector. No failover.≈ 6 s.Same as replica-pod-killed. Kubernetes reschedules the pod to another node with access to the PVC.
Primary PVC lostPrimary pod restarts into an empty MySQL or stuck bootstrap. GTID-empty detection fires on the next successful poll.Fails over (if the primary was active) → after recovery, auto-clones the empty site from the current primary via CLONE INSTANCE.Failover: ≈ 37 s. Clone duration depends on dataset size (minutes to hours).Committed-but-unreplicated transactions on the lost PVC are unrecoverable. Auto-clone requires MYSQL_REPLICATION_USER/PASSWORD in the operator Secret (see Failover → Prerequisites).
AZ / site partition (operator can still reach the surviving side)All poll targets at the isolated side time out; one side's pods go NotReady.Fails over to the reachable site. Sidecar on the partitioned primary self-fences at T + leaseTimeout = T + 20 s, closing the split-brain window.≈ 37 s (failover) to the reachable side. Partitioned side is fenced at 20 s regardless of operator progress.If the partitioned side had unreplicated transactions, they are "stranded" until the partition heals — at which point the operator detects divergentGtid and blocks recovery until an admin runs the reclone flow.
Cross-site partition (link between MySQL pods dies; operator still reaches both)Replica's IO thread stops receiving events; Seconds_Behind_Source climbs. Eventually Degraded=True ReplicationLagging.None automatic. The primary is still writable, the replica is still read-only. The operator alerts but does not fail over.Alert at maxLagSeconds (default 300 s).This mode is indistinguishable, from the operator's point of view, from "replica fell behind because of I/O pressure". Human judgement decides whether to force a failover.
Operator pod crashup{job="bloodraven"} == 0. Liveness probe fails. CR status stops updating.Kubernetes restarts the pod. On return, leader election resumes, polling resumes.Pod restart ≈ 5–10 s.No failover decisions happen during downtime. If the primary fails while the operator is down, writes are unavailable until the operator restart + a full failover sequence completes. Sidecars self-fence to preserve correctness.
Operator down + one site failsCombined: operator liveness fails AND one MySQL site becomes unreachable.Sidecars self-fence at T + leaseTimeout; no promotion happens until the operator returns, then it runs the normal failover sequence.Operator-down duration + ≈ 37 s.Write availability is lost for the operator-down window plus the post-restart failover. Correctness is preserved (no split-brain), but availability takes a direct hit. RPO is unchanged vs. the operator-up case.
Both sites unreachable (TotalLoss)All polls to both sites fail. Degraded=True TotalLoss.None. Emits TotalLossDetected Event and alerts. No automatic recovery is possible — nothing to fail over to.Immediate after failureThreshold polls.Requires human intervention. Once at least one site is back, the operator resumes; if both return, split-brain handling takes over. See Operations → Total loss recovery.
Split brain (writable/writable)Both sites poll as writable simultaneously.Depends on spec.splitBrainPolicy. Default (Manual): alert via SplitBrainDetected Event; require human action. Opt-in (PreferSite): fence the non-preferred site. Post-operator-initiated failover: fence the returning stale primary immediately.One poll cycle (≈ 2 s).No automatic data-merge. If the fenced side accepted writes that the winning side doesn't have, those transactions surface as divergentGtid and require a reclone. See Failover → Split-brain resolution.
Old primary returns divergedAfter a failover, the returning site has committed transactions the new primary never received.Fences the returning site, records status.sites[].divergentGtid + count, sets RecoveryPending=True with reason DivergentTransactions, emits DataLossDetected Event.One poll cycle after the site comes back (≈ 2 s).Cannot auto-rejoin — divergent GTIDs would corrupt replication. Admin must run the reclone flow; the operator enforces the confirmation interlock on the annotation. See Failover → Divergence detected.
DNS provider down / slowDNSEndpoint CR updates succeed, but application DNS lookups still return the old A-record.Operator writes the new DNSEndpoint on failover and moves on — DNS propagation is external-dns's job, not the operator's.CR update happens within the failover sequence (a few seconds). External-dns reconcile time and your DNS provider's TTL decide the app-visible delay.The operator cannot accelerate DNS propagation. Applications should use short TTLs (spec.dns.ttl, default 60 s) and not cache MySQL connection addresses longer than the TTL. Monitor external-dns independently; a stuck external-dns is an outage for writes even after the operator has "finished".
S3 (or backup storage) unreachableBackup Jobs fail; bloodraven_archiver_upload_failures climbs; bloodraven_archiver_backlog_files > 0; PITR lastUploadAt stops advancing.Backup reconciler retries per spec.backup.retry. PITR archiver retries on every scan (archivePollInterval, default 60 s).Retry backoffs are bounded by the retry spec; next archival attempt is one poll away.No data-plane impact — MySQL keeps serving writes. PITR RPO silently drifts as unarchived binlogs accumulate. Alert on bloodraven_archiver_backlog_files > 0 sustained. See Monitoring.
Anti-flap cooldown blocks needed failoverSecond failover condition arises within failoverCooldown of the previous one.Operator logs failover blocked by anti-flap cooldown and does not fail over.No automatic action until the cooldown expires (default 5 min).The planned-failover API also honours this cooldown and will be rejected with reason: CooldownActive. If you genuinely need a second failover inside the cooldown window, perform a manual promotion — that path bypasses the cooldown explicitly.
Planned failover target fails to catch upAdmin annotated bloodraven.shipstream.io/planned-failover=<site>; target's GTID_EXECUTED does not cover source's fenced GTID within maxLagWait.Operator unfences the source, stamps status.plannedFailover.phase: Failed with reason: LagTimeout, emits PlannedFailoverFailed Event. role=primary label is restored; no DNS flip occurred.maxLagWait (default 5 min).No data loss; cluster returns to pre-annotation state. Retry after replication has caught up, or use emergency failover if the source is truly gone. See Planned failover → Rollback.
Dragonfly master killedspec.dragonfly.enabled=true; status.dragonfly.activeSite is unreachable, and one replica is reachable and not syncing/loading.Promotes the replica with REPLTAKEOVER or fallback promotion, updates Dragonfly role/traffic labels, and leaves MySQL status.activeSite unchanged.Next Dragonfly manager poll plus promotion budget.Cache/session continuity is best-effort. If no healthy Dragonfly replica exists, the active Dragonfly Service may have no endpoints until recovery. MySQL availability is unaffected.
Dragonfly unavailable during MySQL emergency failoverMySQL primary is failed, and Dragonfly target/source cannot be reached or promoted.Completes MySQL failover first. Dragonfly promotion is attempted best-effort and may record bloodraven_dragonfly_promotions_total{result="failed"}.MySQL path remains ≈ 37 s; Dragonfly attempt is bounded separately.Sessions/cache may be lost or unavailable. Dragonfly never blocks emergency MySQL recovery.
Planned Dragonfly sync timeoutPlanned failover enters WaitingForDragonflySync, but the target Dragonfly replica does not reach the source offset within maxSyncWait.With onSyncTimeout=proceed (default), promotes MySQL and stamps sessionsPreserved=false. With onSyncTimeout=fail, rolls back before MySQL promotion.spec.dragonfly.plannedFailover.maxSyncWait (default 30 s).proceed preserves MySQL availability at the cost of cache/session continuity. fail preserves the old active site and avoids claiming session preservation.
Replication credentials missing / wrongAuto-clone fails; Bootstrapping condition stays in Failed.Operator reports the failure via the Bootstrapping condition and events; does not keep retrying indefinitely.Immediate (on the first bootstrap attempt).Without MYSQL_REPLICATION_USER / MYSQL_REPLICATION_PASSWORD, old-primary recovery and auto-clone are both disabled. The rest of the operator (polling, failover, DNS) keeps working.

Backup encryption failure modes

These rows apply only when a backup profile has spec.backup.profiles[].encryption set. See Backup encryption for the full threat model.

FailureDetection pathOperator actionTypical time-to-actNotes & limits
Passphrase Secret missing at restore timeRestore Job init container can't mount spec.initFromBackup.decryption.passphraseSecret; kubectl describe pod shows CreateContainerConfigError.Reconciler records DecryptionSecretMissing on the restore CR; stops creating retry Jobs after backoffLimit.Seconds (Job pod fails to start).Recovery: recreate the Secret with the original passphrase value, or delete & re-apply the restore CR. Ciphertext is unrecoverable without the exact passphrase used at backup time — treat the Secret as critical recovery material.
Passphrase rotated between backup and restoredecrypt-download init container exits non-zero with a chunk-auth failure from backupcrypto.Job pod's logs show backupcrypto: chunk 0: cipher: message authentication failed; reconciler reports DecryptionFailed.Immediate on first read.AES-GCM catches mismatched keys cleanly — there's no silent garbage load. Recovery requires producing the old passphrase; there is no re-wrap path today.
Ciphertext truncated by S3 multipart abortdecrypt-download returns backupcrypto: ciphertext truncated before final chunk (ErrTruncated).Job pod fails fast; reconciler reports BackupCiphertextTruncated.Immediate.ErrTruncated distinguishes "tail missing" from a generic auth failure. Recover by re-running the source backup and restoring from the replacement artifact.
Object has no BRV1 magic (tamper / downgrade)decrypt-download returns sidecar: object missing BRV1 magic while encryption is required (ErrTamperedOrDowngrade).Job pod fails fast; reconciler reports BackupDecryptionRejected.Immediate.Defends against an attacker with bucket write access overwriting ciphertext with attacker-chosen plaintext. Legacy mixed-encryption deployments migrating to encryption can opt back into the old passthrough behavior via BLOODRAVEN_ALLOW_PLAINTEXT_FALLBACK=1 (sidecar / decrypt-download init container) — time-bounded migration only.

How to read this table for your on-call

Two common uses:

  1. "What's the operator doing right now?" Find the row matching the symptom, read the Operator action and time-to-act cells to set your clock for when the situation should resolve on its own. If the expected time has passed and nothing has changed, check the Operator limitations column.
  2. "Do I need to page a human?" Alert on the rows whose Operator action is "none automatic" (cross-site partition, total loss, split brain, DNS provider, S3). These are the cases where the operator has deliberately deferred to a human.
  • Failover → State machine — the per-site and cross-site truth tables that drive every row's Operator action cell.
  • Failover → Failover sequence — the step-by-step sequence that produces the "≈ 37 s" number.
  • Operations — runbooks for the rows where the operator action is "none automatic".
  • Monitoring — the metrics and events each row's Observable signal column refers to.
  • Production hardening — which rows here the hardening checklist is specifically guarding against.