Multi-cluster disaster recovery
This page is the operational runbook for recovering a MysqlFailoverGroup
into a separate Kubernetes cluster when the source cluster is lost. It
covers every step from topology to DNS cutover using only the existing
MysqlFailoverGroup CRD surface (available today on main). No new
CRDs or controllers are required to execute this runbook.
WISHLIST #7 introduces MysqlStandbyCluster as a passive observability
layer that surfaces source-archive freshness via BucketReadable /
SourceConfigKnown conditions. Phase 2 adds a readiness gate
(Restorable); Phase 3 adds one-command activation. See
What MysqlStandbyCluster adds
at the bottom of this page for details on what ships in that first phase
and what remains in later phases.
Topology overview
Cross-cluster DR in Bloodraven v1 uses the shared object store (S3 or compatible) as the only channel between the source and DR clusters. There is no cross-cluster replication link, no operator-to-operator RPC, and no federation.
Key points:
- The source cluster writes: full dumps (operator-driven backup Jobs)
and sealed binlogs (sidecar archiver on the active primary, gated on
!@@read_only). The upload switches to a new primary within one archiver scan cycle after an intra-cluster failover. - The DR cluster only reads during bootstrap. It does not run a binlog archiver of its own — there is nothing to archive from a cluster that does not yet exist.
dr-onlysite roles (api/v1alpha1/types.go:280-283, Multi-site topology) are a separate concept: they are passive replicas inside the same cluster as the source MFG and are never auto-promoted. Cross-cluster DR is distinct from this.
Bucket and IAM layout
Bucket structure
Bloodraven writes to the bucket prefix configured in
spec.backup.profiles[].storage.s3.prefix. The layout under that prefix:
<prefix>/
<mysqlbackup-name>/ # one directory per successful full dump
@.json # dump sentinel: GTIDs, size, completion time
<shard-files>.sql.gz # mysqlsh dumpInstance artifacts
binlogs/ # PITR archive (when spec.backup.pitr.enabled=true)
manifest-<site>.json # per-site sealed-binlog manifest
<site>/
<binlog-file>.enc # sealed binlogs (possibly encrypted)
dr-cursors/ # Phase 2: reserved, not written in this release
The DR cluster only needs read access to <prefix>/ to execute this
runbook. Write access to dr-cursors/ is reserved for the
MysqlStandbyCluster verification/readiness feature (Phase 2 of WISHLIST #7).
Minimum IAM policy — AWS
Grant the DR cluster's service account or IAM role the following policy.
Replace shipstream-backups and orders/west with your actual bucket
name and prefix.
DR cluster read-only policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DRReadOnly",
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::shipstream-backups",
"arn:aws:s3:::shipstream-backups/orders/west/*"
]
}
]
}
Scope the ListBucket resource to the bucket name (not a prefix — S3
requires bucket-level ListBucket), and the GetObject resource to the
prefix subtree. The DR restore Job uses the credentialsSecret you
reference in spec.initFromBackup.source.s3; that Secret must contain
AWS credentials (or the pod must have IRSA/Workload Identity access) that
the above policy covers.
:::note Phase 2 cursor writes
When MysqlStandbyCluster ships (Phase 2), the DR side needs one
additional write permission scoped to dr-cursors/* only. Do not add
it until that phase is in use.
:::
GCS / S3-compatible (MinIO, Ceph)
Bloodraven uses the AWS SDK v2 with a configurable endpoint. Set
spec.initFromBackup.source.s3.endpointURL (or
spec.backup.profiles[].storage.s3.endpointURL on the source) to your
endpoint. Minimum equivalent permissions:
| Cloud | Equivalent permissions |
|---|---|
| GCS (interop) | storage.objects.list + storage.objects.get on the prefix |
| MinIO | s3:ListBucket + s3:GetObject — same policy syntax |
| Ceph RGW | Same as MinIO; set endpointURL to your RGW endpoint |
Encryption passphrase distribution
When the source profile uses spec.backup.profiles[].encryption, every
dump artifact and archived binlog is wrapped in the Bloodraven BRV1
format (AES-256-GCM envelope). The DR restore Job must have the same
passphrase available.
The operator does not distribute passphrases across clusters. Follow these steps before triggering a recovery:
Step 1. Read the passphrase from the source cluster:
kubectl -n orders get secret orders-backup-passphrase \
-o jsonpath='{.data.passphrase}' | base64 -d
Step 2. Create the same Secret in the DR cluster's namespace:
kubectl -n orders create secret generic orders-backup-passphrase \
--from-literal=passphrase='<value from step 1>'
Step 3. Reference it in the DR restore manifest via
spec.initFromBackup.decryption.passphraseSecret.name. See
Backup encryption for full field semantics.
If the passphrase Secret is missing or wrong, the restore Job fails fast with:
Warning RestoreBuildFailed
initFromBackup source is encrypted but no passphrase is available;
set initFromBackup.decryption.passphraseSecret or restore the
profile's encryption.passphraseSecret
This is a clean, recoverable error — fix the Secret and delete the failed Job; the reconciler rebuilds it.
Source fencing checklist
Before you trigger a recovery into the DR cluster, confirm the source is genuinely down. Promoting the DR copy while the source is still accepting writes produces two writable clusters. Bloodraven v1 does not automatically detect or resolve cross-cluster split-brain — divergent GTIDs accumulate on the source side and can only be audited post-hoc (see Durability and RPO).
Use at least two of these three signals before proceeding:
Signal 1 — Source operator /active-site endpoint returns 5xx:
curl -f 'http://<source-operator-aux-svc>:8082/active-site?namespace=orders&group=orders'
# Healthy: {"activeSite":"iad"}
# Down: curl: (22) The requested URL returned error: 503
The operator's auxiliary HTTP server (cmd/bloodraven/main.go:361) serves
/active-site on port 8082 and requires namespace plus group query
parameters. It returns 503 when it cannot reach any writable site.
Signal 2 — Source cluster API server is unreachable:
kubectl --context=<source-context> get nodes --request-timeout=10s
# Error from server: timeout waiting for server response
Signal 3 — Source MySQL endpoint is TCP-unreachable from a third vantage point:
# From a network location that is NOT one of the two clusters
nc -zv <source-mysql-lb-ip> 3306
# Connection refused / timeout
If any signal is ambiguous — for example, the API server is reachable but appears degraded — wait for a clearer signal rather than promoting. The cost of delaying a few minutes is far lower than the cost of a split-brain incident.
:::warning Split-brain risk
Bloodraven v1 does not fence the source before or after you promote the
DR cluster. Issuing initFromBackup on the DR cluster when the source is
still alive produces two writable groups. Use the three signals above to
be certain before proceeding.
:::
Recovery procedure
This walkthrough assumes:
- The source
MysqlFailoverGroupis namedorders, in namespaceorders, and had PITR enabled under a profile namednightly-s3stored ats3://shipstream-backups/orders/west. - The DR cluster has network access to the same S3 bucket.
- The passphrase Secret has been mirrored (see above).
- The DR cluster has a fresh namespace
orderswith Bloodraven installed (same operator version as the source; see Install production).
Step 1 — Identify the recovery point
Locate the most recent successful dump in the source bucket:
aws s3 ls s3://shipstream-backups/orders/west/ --recursive \
| grep '@\.json' | sort -k1,2 | tail -5
Each directory under orders/west/ that contains @.json is a
completed dump. The @.json holds the completion timestamp, GTID set,
and dump size. Pick the most recent @.json whose directory you want to
recover from. Note its prefix — for example,
orders/west/orders-nightly-20260520.
To target a specific point in time (e.g., 14:32 UTC before a bad
migration), note the desired stopDatetime. Binlogs replayed after the
dump will be bounded by the newest sealed binlog archived before the
event; the unarchived tail on the lost primary's PVC is not recoverable
(see PITR and the RPO window).
Step 2 — Apply the DR MysqlFailoverGroup
Create a MysqlFailoverGroup on the DR cluster configured for the DR
environment's nodes, IPs, and DNS. Set spec.initFromBackup to pull from
the source bucket.
Source group configuration (reference — source cluster):
apiVersion: shipstream.io/v1alpha1
kind: MysqlFailoverGroup
metadata:
name: orders
namespace: orders
spec:
sites:
- name: iad
role: primary-candidate
zone: us-east-1a
taintNodeSelector:
shipstream.io/failover-group.orders: "true"
shipstream.io/site.orders: iad
lbIP: 10.0.10.11
storage: { storageClassName: gp3, size: 500Gi }
- name: iad-2
role: primary-candidate
zone: us-east-1b
taintNodeSelector:
shipstream.io/failover-group.orders: "true"
shipstream.io/site.orders: iad-2
lbIP: 10.0.10.12
storage: { storageClassName: gp3, size: 500Gi }
secretName: mysql-operator-creds
dns:
hostname: orders-east.example.com
ttl: 60
backup:
profiles:
- name: nightly-s3
storage:
type: S3
s3:
bucket: shipstream-backups
prefix: orders/west
region: us-west-2
credentialsSecret: s3-backup-creds
encryption:
passphraseSecret:
name: orders-backup-passphrase
pitr:
enabled: true
profileName: nightly-s3
DR cluster recovery manifest:
apiVersion: shipstream.io/v1alpha1
kind: MysqlFailoverGroup
metadata:
name: orders
namespace: orders
spec:
# DR-local node topology — different nodes, IPs, and zones from source
sites:
- name: east-1
role: primary-candidate
zone: us-east-1a
taintNodeSelector:
shipstream.io/failover-group.orders: "true"
shipstream.io/site.orders: east-1
lbIP: 10.1.20.11 # DR cluster's LB IP
storage: { storageClassName: gp3, size: 500Gi }
- name: east-2
role: primary-candidate
zone: us-east-1b
taintNodeSelector:
shipstream.io/failover-group.orders: "true"
shipstream.io/site.orders: east-2
lbIP: 10.1.20.12
storage: { storageClassName: gp3, size: 500Gi }
secretName: mysql-operator-creds # must exist in DR namespace
dns:
hostname: orders-east.example.com
ttl: 60
initFromBackup:
source:
s3:
bucket: shipstream-backups
prefix: orders/west/orders-nightly-20260520 # exact dump directory
region: us-west-2
credentialsSecret: s3-dr-readonly-creds # DR read-only creds
decryption:
passphraseSecret:
name: orders-backup-passphrase # mirrored from source
pointInTime: # omit if recovering to dump GTID
stopDatetime: "2026-05-20T14:32:00Z"
Apply to the DR cluster:
kubectl --context=<dr-context> apply -f dr-orders.yaml
Step 3 — Monitor the restore
kubectl --context=<dr-context> -n orders get mysqlfailovergroup orders -w
Watch status.restore.phase. The sequence:
| Phase | Meaning |
|---|---|
Pending | Restore Job not yet created |
Running | util.loadDump() is running (and optionally mysqlbinlog replay after) |
Succeeded | Dump and any PITR replay completed; bootstrap continues |
Failed | Inspect status.restore.message and Job logs |
Full status:
kubectl --context=<dr-context> -n orders describe mysqlfailovergroup orders
The initFromBackup path is one-shot and runs before the failover group
is considered ready (api/v1alpha1/backup_types.go:552-557). Once
status.restore.phase == Succeeded, bootstrap (clone, replication,
fencing) completes normally.
Step 4 — Verify the DR group is healthy
kubectl --context=<dr-context> -n orders get mysqlfailovergroup orders -o yaml \
| grep -A5 'conditions:'
Wait for Ready=True. Check that at least one site is writable and
replication is up on the peer:
kubectl --context=<dr-context> -n orders get mysqlfailovergroup orders \
-o jsonpath='{.status.activeSite}'
Step 5 — DNS cutover
Bloodraven writes a DNSEndpoint CR per site using
external-dns/endpoint-controller.io/v1 (api/v1alpha1/types.go:371-384,
internal/platform/dns.go:23-31). The operator controls the
per-cluster DNS record for spec.dns.hostname in whatever cluster it
runs in. It does not control the global application-facing record.
Choose one of these cutover patterns:
Option A — Weighted CNAME with health checks (recommended for production)
Maintain a weighted CNAME (Route53, Google Cloud DNS, Akamai GTM) or a GSLB record that points at both clusters. On source loss:
- Set source weight to 0 (or let health checks drop it automatically if the source LB IP is unreachable).
- Set DR weight to 100.
- TTL propagation time depends on your DNS TTL. Pre-reducing TTL to ≤ 60 s before an expected maintenance window shortens this.
Option B — Manual flip (simplest)
# Point the application CNAME at the DR cluster's DNS name
aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "db.example.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "orders-east.example.com"}]
}
}]
}'
Option C — GSLB / Akamai GTM
Mark the source origin as offline in the GTM policy. The GSLB health monitor will have already removed it from rotation if the TCP probe to MySQL is failing.
After DNS propagation, confirm applications are connecting to the DR
cluster (check status.activeSite and replication lag on the DR group).
Failback narrative
When the source cluster returns, the original path is reversible. This is a manual, prose-only procedure in v1; Phase 4 of WISHLIST #7 will document and partially automate it.
The high-level steps:
-
Confirm the DR cluster is steady-state primary. New writes are landing on the DR cluster; intra-cluster replication (if multi-site) is healthy.
-
Wipe the source cluster's MySQL state. Delete the old
MysqlFailoverGroupCR. The operator cascades to Deployments, Services, and DNSEndpoints; PVCs are not cascaded — delete them explicitly:kubectl --context=<source-context> -n orders delete mysqlfailovergroup orderskubectl --context=<source-context> -n orders delete pvc \-l bloodraven.shipstream.io/failover-group=orders -
Ensure the DR group's backup prefix does not collide with the original source prefix. A directional prefix convention avoids ambiguity — for example, use
orders/east/for the DR group's dumps andorders/west/for the original source. Update the DR group'sspec.backup.profiles[].storage.s3.prefixaccordingly before the DR group takes its first backup. -
Stand up a fresh
MysqlFailoverGroupon the source cluster withspec.initFromBackuppointing at the DR cluster's new prefix (the backup that the now-primary DR cluster took). This is the same pattern as the original recovery, but in reverse. Wait forstatus.restore.phase=Succeeded. -
Cut DNS back if desired. This step is optional and human-driven. Bloodraven v1 does not auto-rebalance traffic across clusters.
Phase 4 (WISHLIST #7) introduces MysqlStandbyCluster symmetric failback:
a second standby CR on the returning source cluster pointing at the DR
cluster's new prefix, with the same continuous verification and
one-command activation path that Phase 3 provides for the forward
direction.
What MysqlStandbyCluster (Phase 1) adds today
WISHLIST #7 Phase 1 introduces MysqlStandbyCluster — a new CRD that
lives on the DR cluster and continuously monitors the source bucket.
What Phase 1 ships:
- A new
MysqlStandbyClusterCR (shipstream.io/v1alpha1, short namemsc) that names the relationship between a source MFG's backup archive and this DR cluster. - A passive reconciler (
MysqlStandbyClusterReconciler) that runs a discovery loop on a configurable cadence (default 5 minutes):- Lists objects under the configured S3 prefix to find the most recent successful full dump and per-site binlog manifests.
- Populates
status.discoveredwith the dump location, completion time, GTID set, and binlog window timestamps. - Stamps
BucketReadable=True/Falsebased on whether the bucket scan succeeded. - Stamps
SourceConfigKnown=True/Falsebased on whether at least one dump and one binlog manifest were found.
- No mysqld is started, no dump is loaded, and no
MysqlFailoverGroupis materialized. Phase 1 is purely observational.
What Phase 1 does NOT ship:
- Continuous restore verification (
Restorablecondition) — Phase 2. - One-command activation (
dr-activatekubectl plugin + a future activate-request state machine) — Phase 3. - Failback tooling — Phase 4.
The spec fields that drive Phase 2 verification and Phase 3 activation are
not part of v1alpha1 yet. They will be added back (backward-compatibly)
when that code ships, rather than shipping inert fields that enforce nothing
today. The template field is the exception: it is required at create-time
so the full activated topology is validated while the cluster is calm, not
mid-incident during a promote.
An on-call engineer recovering from a cluster loss still executes this
runbook manually. Phase 1's value is operational visibility: you can see
at a glance whether the source bucket is readable and whether a dump +
binlog window exist, without having to run aws s3 ls yourself.
Sample MysqlStandbyCluster CR:
apiVersion: shipstream.io/v1alpha1
kind: MysqlStandbyCluster
metadata:
name: orders-east-from-west
namespace: orders
spec:
transport: ObjectStore
source:
failoverGroupName: orders
cluster: us-west-prod # informational
namespace: orders
profileName: nightly-s3
storage:
type: S3
s3:
bucket: shipstream-backups
prefix: orders/west
region: us-west-2
credentialsSecret: s3-dr-readonly-creds
decryption:
passphraseSecret:
name: orders-backup-passphrase
# template declares the MysqlFailoverGroup to materialise on activate
# (Phase 3). It is validated as a full MysqlFailoverGroupSpec at create
# time — dns and at least two primary-candidate sites are required — so fill
# it in now, not during an incident. See examples/standby-cluster.yaml for
# the canonical reference.
template:
name: orders-east-dr
spec:
image: mysql:9.6
credentials:
operatorSecret: orders-mysql-operator
appSecret: orders-mysql-app
dns:
hostname: orders-east.example.com
ttl: 60
sites:
- name: east-az1
role: primary-candidate
zone: us-east-1a
taintNodeSelector:
shipstream.io/failover-group.orders: "true"
shipstream.io/site.orders: east-az1
lbIP: 10.0.1.1
storage:
storageClassName: fast-ssd
size: 100Gi
- name: east-az2
role: primary-candidate
zone: us-east-1b
taintNodeSelector:
shipstream.io/failover-group.orders: "true"
shipstream.io/site.orders: east-az2
lbIP: 10.0.2.1
storage:
storageClassName: fast-ssd
size: 100Gi
# freshness.discoveryInterval (default 5m) controls the scan cadence.
# The Phase 2 verification/staleness knobs and the Phase 3 activate block
# are not part of v1alpha1 yet — they will be added back
# (backward-compatibly) when that code ships. Phase 1 reads source.storage
# and source.decryption to drive the discovery loop.
Check status:
kubectl -n orders get mysqlstandbycluster orders-east-from-west -o yaml
Look for status.discovered.dumpLocation, status.discovered.dumpCompletionTime,
and the BucketReadable and SourceConfigKnown conditions.
Planned follow-up phases (tracking issue: WISHLIST #7):
- Phase 2 — Continuous DR readiness: the reconciler schedules
periodic
MysqlBackupVerificationruns against the discovered dump + binlog window, publishing aRestorablecondition and abloodraven_dr_restorable_timestamp_secondsgauge. The source operator gains awareness ofdr-cursors/sentinel objects to prevent premature binlog pruning. Prerequisite: WISHLIST #43 (PITR E2E scenarios). - Phase 3 — One-command activation:
kubectl bloodraven dr-activate- a future confirm-token gate materializes a writable
MysqlFailoverGroupfrom the discovered dump. The activation state machine is phase-stamped and durable across operator restarts. The spec field that carries the confirm token will be added when this lands.
- a future confirm-token gate materializes a writable
- Phase 4 — Failback tooling: symmetric
MysqlStandbyClusteron the returning source cluster + extended runbook.
Until Phase 3 ships, activation always requires the manual
spec.initFromBackup steps described in this runbook.