Skip to main content

Operations

operations infographic

Day-2 operations guide for managing Bloodraven failover groups, including manual intervention, recovery procedures, and maintenance tasks. Before rolling Bloodraven out to production, also walk through the Production hardening checklist and keep the Failure-mode matrix open during incidents. The Log schema contract lists the stable msg strings to filter on when correlating an incident in your log aggregator.

:::tip Incident entry point Use Operations Overview, Runbooks, and Alert To Runbook Map during active incidents. This page remains the detailed operations reference. :::

:::tip CLI shortcut Most day-2 commands on this page have a kubectl bloodraven ... equivalent in the kubectl plugin. The plugin validates inputs before posting, supports synchronous --wait, and keeps the annotation grammar honest. :::

Operations decision table

TaskUse whenDo not use whenVerify with
Planned failoverHealthy target and scheduled maintenanceSplit-brain or target laggingstatus.plannedFailover, DNS, app writes
Manual promotionOperator unavailable and old primary fencedOperator is healthyactive site, GTIDs, DNS
Split-brain recoveryMultiple writable sitesOnly DNS is stalewritable state and GTID comparison
Reclone old primaryDivergence accepted or no data to recoverData owner has not reviewed divergencereplication running and lag low
Backup restoreProduction recovery or drillSource artifact unverifiedMySQL smoke test and app validation

Planned failover (graceful switchover)

To promote a replica during planned maintenance, annotate the MysqlFailoverGroup with the target site name. The operator drains writes on the current primary, waits for the target to catch up, and promotes it atomically -- no kubectl exec required.

kubectl annotate mysqlfailovergroup orders \
bloodraven.shipstream.io/planned-failover=pdx

Optional per-request override (otherwise the spec.plannedFailover defaults apply):

kubectl annotate mysqlfailovergroup orders \
bloodraven.shipstream.io/planned-failover=pdx:maxLagWait=30s

Watch progress on the CR:

kubectl get mysqlfailovergroup orders \
-o jsonpath='{.status.plannedFailover}{"\n"}'

The state machine advances through Pending → Validating → Draining → WaitingForLag → Promoting → Resuming → Succeeded, emitting Kubernetes Events (PlannedFailoverStarted, PlannedFailoverDraining, PlannedFailoverLagOK, PlannedFailoverCompleted) at each transition. See Planned failover for the full lifecycle, rollback behaviour, and RPO guarantees.

Unlike the legacy kubectl exec dance below, this path:

  • Respects the anti-flap cooldown. A planned failover cannot run within spec.failoverCooldown of the last failover (planned or emergency).
  • Guarantees zero RPO on success. Promotion only proceeds after GTID_EXECUTED on the target ⊇ the fenced source.
  • Rolls back on timeout. If the target does not catch up within maxLagWait, the operator clears the fence on the source and reports status.plannedFailover.phase: Failed with reason: LagTimeout.
  • Leaves an audit trail. Every attempt is recorded on the CR and in the bloodraven_planned_failovers_total{result=...} counter.

Manual promotion (fallback, operator unreachable)

If the operator is unreachable and you must promote a site by hand, the legacy three-step dance still works. Use this only as a break-glass procedure; prefer the annotation API above whenever the operator is healthy.

Step 1: Fence the current primary

kubectl exec mysql-orders-iad-0 -c mysql -- \
mysql -u root -p -e "SET GLOBAL super_read_only=ON;"

Step 2: Stop and reset replication on the target

kubectl exec mysql-orders-pdx-0 -c mysql -- \
mysql -u root -p -e "STOP REPLICA; RESET REPLICA ALL;"

Step 3: Promote the target

kubectl exec mysql-orders-pdx-0 -c mysql -- \
mysql -u root -p -e "SET GLOBAL read_only=0;"

When the operator returns it will observe the new topology on its next poll cycle and update Services, DNS, and taints accordingly. No operator restart is required.

caution

Manual promotion bypasses the anti-flap cooldown and the zero-lag gate. Ensure you understand the current replication state before promoting a site -- data loss is possible if the target has not finished applying relay logs. The planned-failover API above is safer; reach for this only when the operator is genuinely unavailable.

Split-brain recovery

If both sites report writable (split brain), the operator will not take automatic action. You must resolve this manually:

  1. Identify the authoritative site -- determine which site has the most recent data by comparing GTID sets:

    kubectl exec mysql-orders-iad-0 -c mysql -- \
    mysql -u root -p -e "SELECT @@gtid_executed;"

    kubectl exec mysql-orders-pdx-0 -c mysql -- \
    mysql -u root -p -e "SELECT @@gtid_executed;"
  2. Fence the non-authoritative site:

    kubectl exec mysql-orders-pdx-0 -c mysql -- \
    mysql -u root -p -e "SET GLOBAL super_read_only=ON;"
  3. Re-establish replication from the authoritative site to the fenced site:

    kubectl exec mysql-orders-pdx-0 -c mysql -- \
    mysql -u root -p -e "
    CHANGE REPLICATION SOURCE TO
    SOURCE_HOST='mysql-orders-iad.default.svc.cluster.local',
    SOURCE_AUTO_POSITION=1;
    START REPLICA;
    "
  4. The operator will detect the corrected topology and resume normal operation.

danger

Split brain means both sites accepted writes independently. You may need to reconcile conflicting data manually before re-establishing replication. Replication will break if conflicting transactions exist on both sides.

Recovering a divergent old primary

After an emergency failover, the old primary may come back with transactions that never replicated. If GTID sets match, the operator sets RecoveryPending=True with reason RecoveryInProgress, reconfigures the old primary as a replica, and clears the condition after replication is healthy. If the old primary has divergent transactions, the operator fences the site and sets RecoveryPending=True with reason DivergentTransactions.

Check the divergence details:

kubectl get mysqlfailovergroup orders -o jsonpath='{range .status.sites[*]}{.name}: recoveryState={.recoveryState} divergentTxns={.divergentTransactionCount} divergentGtid={.divergentGtid}{"\n"}{end}'

To recover the site, investigate the lost transactions and then trigger a reclone:

Step 1: Investigate the lost transactions

Review the divergent GTID set to understand what data was lost. These transactions were committed on the old primary but never reached the replica before failover.

Step 2: Trigger a reclone

Use the reclone annotation to have the operator run CLONE INSTANCE on the divergent site, replacing all its data with a fresh copy from the current primary. The annotation value must include a prefix (8 or more characters) of the observed divergentGtid — this confirmation interlock prevents a fat-fingered site name from destroying the wrong replica:

# Read the divergent GTID first:
DG=$(kubectl get mysqlfailovergroup orders -o jsonpath='{.status.sites[?(@.name=="iad")].divergentGtid}')
echo "$DG"
# a1b2c3d4-0000-0000-0000-000000000000:11-15

# Annotate with <site>:<first-8+-chars-of-GTID>
kubectl annotate mysqlfailovergroup orders bloodraven.shipstream.io/reclone-site=iad:a1b2c3d4

If the prefix doesn't match the observed divergentGtid (wrong site, stale copy of the GTID, typo), the operator emits a RecloneRejected Warning Event and clears the annotation — no clone is started. Re-read status.sites[].divergentGtid and try again.

A RecloneRequested Normal Event marks the start. The site will temporarily appear unreachable during the MySQL restart that follows the clone. Progress is reported in the Bootstrapping condition.

PVC loss recovery runbook

Use this when one site's MySQL data PVC is irrecoverable: cloud volume deleted, local disk lost, filesystem corrupted beyond repair, or a failed clone left the data directory unusable.

:::danger Data loss boundary If the lost PVC belonged to a primary that accepted writes not yet replicated elsewhere, those transactions are unrecoverable from Bloodraven. PITR can only replay binlog files that survived long enough to be archived; a lost primary PVC takes its unarchived binlog tail with it. See Durability and RPO. :::

1. Identify the failed site and current primary

kubectl get mysqlfailovergroup orders \
-o jsonpath='{.status.activeSite}{"\n"}{range .status.sites[*]}{.name}: state={.state} recovery={.recoveryState} divergent={.divergentGtid}{"\n"}{end}'

Do not wipe the active primary unless you are deliberately restoring from backup. If the damaged site is the active primary, first let the operator fail over or run a planned failover to a healthy primary-candidate site.

2. Confirm the target has no recoverable divergent data

For normal old-primary divergence, use the GTID-prefix reclone flow above. For PVC loss, the target has no usable data to preserve, so the operator accepts the cold reclone form:

kubectl annotate mysqlfailovergroup orders \
bloodraven.shipstream.io/reclone-site=iad

The operator emits RecloneRequested and drives the same clone path used for automatic bootstrap. It will not reclone the current active primary.

3. If the PVC object itself is bad, delete it

If Kubernetes is still binding the pod to a corrupted claim, delete the site pod and PVC so the operator can recreate storage from the CR spec:

kubectl delete pod mysql-orders-iad-0
kubectl delete pvc data-mysql-orders-iad-0

If your cluster uses local-path storage, clear any db-readonly taint from the node before deleting the PVC; otherwise the provisioner's helper pod may be evicted and the replacement claim can stay Pending.

4. Watch bootstrap and clone progress

kubectl describe mysqlfailovergroup orders
kubectl get pods -l shipstream.io/failover-group=orders -w

Expected condition flow:

PhaseWhat you should see
RequestedRecloneRequested Event on the MysqlFailoverGroup.
BootstrappingBootstrapping=True; the target site pod may restart while CLONE INSTANCE replaces the data directory.
RecoveringThe cloned site comes back read-only and replication is configured from the active primary.
HealthyBootstrapping=False, RecoveryPending=False, target state=read-only, recoveryState empty, replication running, and divergentGtid empty.

Clone time is proportional to dataset size and storage throughput. Small playground datasets finish in minutes; production datasets can take much longer.

5. Verify the site rejoined safely

kubectl get mysqlfailovergroup orders \
-o jsonpath='{range .status.sites[*]}{.name}: state={.state} replicating={.replicating} lag={.secondsBehindSource} recovery={.recoveryState}{"\n"}{end}'

The recovered site should be read-only, replicating from the current primary, and below your spec.replication.maxLagSeconds threshold before you consider the incident closed.

caution

If the old primary came back with no divergence (GTID sets match), the operator automatically reconfigures it as a replica — no manual intervention is needed. The RecoveryPending condition only appears when divergent transactions exist.

Total loss recovery

If both sites are unreachable, the operator cannot take any action. Recovery steps:

  1. Investigate infrastructure (nodes, network, storage) at both sites
  2. Bring at least one MySQL instance back online
  3. The operator will detect the recovered site and begin normal operation
  4. Once the second site is back, re-establish replication if it does not resume automatically

Scaling storage

To resize PVCs, update the spec.sites[].storage.size field. This requires the StorageClass to support volume expansion (allowVolumeExpansion: true).

spec:
sites:
- name: iad
storage:
storageClassName: fast-ssd
size: 200Gi # increased from 100Gi

The operator will update the PVC. The underlying volume expansion behavior depends on your storage provider -- some require a pod restart.

Updating MySQL version

Change spec.image to the new version:

spec:
image: mysql:9.7

If updateStrategy: OrderedUpdate is set, the rollout follows the ordered update sequence with zero downtime. Otherwise, both sites are updated simultaneously.

See the Upgrade and version-skew policy for which MySQL tags are supported, how the 9.6 → 9.7 transition works, and the cross-site and sidecar ↔ operator skew contract.

Updating the sidecar

Change spec.sidecarImage:

spec:
sidecarImage: ghcr.io/shipstream/bloodraven-sidecar:v2.0.0

This follows the same update strategy as MySQL image changes.

Rotating MySQL credentials

Credentials mode (spec.credentials)

When using per-role credential secrets, rotation is fully automated:

  1. Update the password value in the relevant Kubernetes Secret (e.g. mysql-operator-creds)
  2. The operator detects the change, connects to the primary, and runs ALTER USER to update the MySQL password
  3. Pods are rolling-restarted to pick up the new credentials from the updated Secret

No manual MySQL commands or operator restarts are needed.

Legacy mode (spec.secretName)

  1. Update the password on both MySQL instances directly
  2. Update the Kubernetes Secret referenced by spec.secretName
  3. Restart the operator to pick up the new credentials (or wait for the Secret to be reloaded, depending on your Secret mounting configuration)

DNS provider configuration

DNS provider credentials and settings are managed entirely by external-dns, not by Bloodraven. To change your DNS provider or rotate provider credentials, update your external-dns deployment configuration. Bloodraven only manages the DNSEndpoint CR; external-dns handles syncing it to the actual DNS provider.

Maintenance mode

To prevent automatic failovers during maintenance (e.g., network changes), you can temporarily increase the failover cooldown:

kubectl patch mysqlfailovergroup orders --type merge -p '
{"spec": {"failoverCooldown": "24h"}}'

Restore it after maintenance:

kubectl patch mysqlfailovergroup orders --type merge -p '
{"spec": {"failoverCooldown": "30m"}}'

Deleting a failover group

The operator uses a finalizer to ensure graceful cleanup when a MysqlFailoverGroup is deleted:

  1. Node taints are removed from all sites
  2. The DNSEndpoint CR is automatically garbage-collected via its owner reference to the MysqlFailoverGroup
  3. MySQL pods, Services, PVCs, and the PodDisruptionBudget are deleted
  4. The finalizer is removed and the CR is deleted

To delete:

kubectl delete mysqlfailovergroup orders
caution

PVCs are deleted as part of cleanup. If you need to preserve data, back up the MySQL instances before deleting the failover group.