Skip to main content

Backup verification

backup verification infographic

An untested backup is a schrödinger backup — it is both good and bad until you try to restore it, which you usually do during the incident that made you reach for it. Bloodraven's verification feature closes that loop: it periodically restores the most recent successful backup into a throwaway MySQL instance to prove the backup is actually loadable, and exports a freshness gauge so alerting can fire on stale or broken verifications.

Phase 1 established the CRD and the dump-load contract. Phase 2 layers PITR binlog replay and configurable scalar sanity queries on top of the same CRD. See WISHLIST.md #8 for remaining roadmap items (kubectl plugin integration, Grafana panels).

Concepts

  • MysqlBackupVerification CRD — one CR per verification run. Created ad-hoc via kubectl create or by the scheduled CronJob that the operator renders from a profile's verification block. Tracks phase, start/completion times, resolved backup reference, and the names of the ephemeral Pod / Job / PVC used for the run.
  • Verification schedulespec.backup.profiles[].verification on the failover group. When enabled: true, the operator materializes a CronJob whose pod runs bloodraven trigger-verification to create a MysqlBackupVerification CR.
  • Ephemeral instance — each run provisions a dedicated PVC and a dedicated Job pod. The pod boots mysqld on the PVC with --bind-address=127.0.0.1, then delegates the actual util.loadDump() to the shared restore.py script against that local instance. No external Service is created; the verification instance is never reachable from outside the pod network namespace.

What counts as "verified"

The default verification contract is the backup dump loads into an empty mysqld without error. That covers the failure modes that cause most real disasters — corrupt object store, truncated dump, missing credential rotation, storage class regression, mysqlsh version incompatibility. When you also set spec.pointInTime or spec.sanityCheck the contract extends to "archived binlogs replay on top of the dump" and "a caller-supplied SELECT returns a result that matches an expectation," respectively.

The feature deliberately does not cover:

  • Logical equivalence with the live primary (a different tool).
  • Backup encryption decryption (wishlist #13).
  • Application-level rehearsal of writes / traffic cutover (wishlist #7, #11).

Scheduling a verification

Attach a verification block to any BackupProfile:

apiVersion: shipstream.io/v1alpha1
kind: MysqlFailoverGroup
metadata:
name: orders
spec:
backup:
image: container-registry.oracle.com/mysql/community-server:9.6
profiles:
- name: nightly
storage:
type: S3
s3:
bucket: orders-backups
prefix: nightly
region: us-east-2
credentialsSecret: orders-backups-creds
retentionPolicy:
count: 14
minKeep: 3
verification:
enabled: true
schedule: "0 5 * * *" # after the 02:00 backup finishes
concurrencyPolicy: Forbid
storage:
storageClassName: fast-ssd # optional; defaults to cluster default
retentionPolicy:
keepSuccessful: 30 # keep last 30 Succeeded runs
keepFailures: 10 # always keep last 10 Failed runs

The operator reconciles this into a CronJob named mysql-<group>-verify-<profile> whose pod invokes bloodraven trigger-verification --group=<group> --profile=<profile>. That subcommand creates a MysqlBackupVerification CR, which the verification reconciler picks up.

Use concurrencyPolicy: Forbid (the default) unless you are confident two verifications of the same profile can run side by side; they cannot share a PVC and Allow just stacks failed runs.

Running a one-off verification

Any operator with permission to create MysqlBackupVerification CRs can trigger an ad-hoc run:

kubectl create -f - <<'EOF'
apiVersion: shipstream.io/v1alpha1
kind: MysqlBackupVerification
metadata:
generateName: orders-nightly-drill-
namespace: bloodraven
spec:
failoverGroupRef: { name: orders }
profileName: nightly
# backupRef omitted → verifies the latest Succeeded MysqlBackup
# for (group=orders, profile=nightly).
EOF

To verify a specific historical backup:

spec:
failoverGroupRef: { name: orders }
profileName: nightly
backupRef:
name: orders-nightly-20260413-0200

PITR binlog replay

When the failover group has spec.backup.pitr.enabled: true, a verification can confirm that the archived binlog stream replays cleanly on top of the dump. Add spec.pointInTime:

spec:
failoverGroupRef: { name: orders }
profileName: nightly
pointInTime:
mode: latest # one of: none | latest | timestamp
  • mode: none (default) — no replay; equivalent to leaving pointInTime unset.
  • mode: latest — replay through the newest archived event. The operator wires a bloodraven pitr-download init container that downloads manifests + binlog files into an emptyDir, then verify.sh streams them into the ephemeral mysqld via mysqlbinlog.
  • mode: timestamp — replay stops just before the first event after spec.pointInTime.timestamp (RFC3339).

On success, status.replayedThroughBinlog is populated with the last applied file, position, and server-clock timestamp. The bloodraven_backup_verification_replay_lag_seconds gauge is published as completionTime − replayedThroughBinlog.timestamp; alert on a threshold that matches your RPO target.

Sanity query

After the dump loads (and optional PITR replay succeeds), a single scalar-returning SELECT can be evaluated:

spec:
sanityCheck:
query: "SELECT COUNT(*) FROM orders.orders WHERE created_at > NOW() - INTERVAL 7 DAY"
expect:
minRows: 1 # optional; fails if the scalar is below this
maxDurationSeconds: 60

The verification records the scalar value in status.sanityCheck.resultRow on success, or an error string on failure. maxDurationSeconds is enforced with a client-side timeout; exceeding it fails the run with reason SanityCheckTimeout, and a query error or below-floor scalar fails with SanityCheckFailed.

Lifecycle

A verification CR passes through these phases:

PhaseMeaning
PendingAccepted; finalizer stamped; no ephemeral resources yet.
ProvisioningEphemeral PVC + derived credentials Secret created.
RestoringThe verification Job is running verify.sh + restore.py.
CheckingReserved for Phase 2 sanity queries; Phase 1 bypasses this phase.
CleaningReconciler is deleting ephemeral resources.
SucceededTerminal success. Gauge advances; ephemeral resources cleaned up.
FailedTerminal failure. See KeepOnFailure below.

On the happy path the CR goes Pending → Provisioning → Restoring → Succeeded. The ephemeral Pod, Job, Service-if-any, and PVC are deleted after an optional TTL.

Storage sizing

The ephemeral PVC is always dedicated per run and auto-sized from the referenced backup's status.sizeBytes:

  • Explicit spec.storage.size wins when set.
  • Otherwise: max(10 GiB, ceil(1.5 × backupSizeBytes / 10 GiB) × 10 GiB).
  • The 1.5x multiplier gives MySQL headroom for indexes, temp files, and the occasional tablespace fragmentation that arises during load.

Override spec.storage.storageClassName when you want a faster class than the cluster default; verification wall-clock time scales roughly linearly with datadir write throughput.

Failure handling and keepOnFailure

On Succeeded, the ephemeral Pod and PVC are always deleted after spec.ttlSecondsAfterFinished (default: immediately).

On Failed, the default (spec.keepOnFailure: true) leaves the Pod and PVC in place so operators can kubectl exec into the verification instance, tail its mysqld error log, or attach an interactive mysqlsh session to inspect whatever the load got as far as. Failed verifications are still GC'd by the retention sweep once they drop out of the keepFailures window.

Set keepOnFailure: false on a one-off CR to force full cleanup even on failure — useful for repeated-probe use cases where the failure signal alone is what you care about.

Concurrency

The reconciler refuses to run two verifications against the same (group, profile) pair simultaneously. Newer CRs land in Failed with the BlockedByActiveVerification condition reason. Scheduled runs use concurrencyPolicy: Forbid by default so this rejection is rare; it mostly protects against kubectl create being issued right as the nightly schedule fires.

Metrics

Verification metrics mirror the bloodraven_backup_* family and share the (group, profile) label set:

MetricTypeMeaning
bloodraven_backup_verified_timestamp_secondsGaugeUnix time of last Succeeded verification — the freshness gauge
bloodraven_backup_verification_last_attempt_timestamp_secondsGaugeUnix time of last terminal attempt, success or failure
bloodraven_backup_verification_runs_totalCounterTerminal attempts by result="success" or result="failure"
bloodraven_backup_verification_duration_secondsHistogramWall-clock duration from StartTime to CompletionTime
bloodraven_backup_verification_replay_lag_secondsGaugecompletionTime − replayedThroughBinlog.timestamp (Succeeded + PITR)

Alerts

The obvious alert is staleness on the freshness gauge:

# Verification hasn't Succeeded in more than 48h:
time() - bloodraven_backup_verified_timestamp_seconds > 48 * 3600

And a failure-rate alert for fast signal:

# More than 2 failed verifications in 24h:
increase(bloodraven_backup_verification_runs_total{result="failure"}[24h]) > 2

Inspecting results

kubectl get mysqlbackupverifications -A
# Group Profile Phase Started Completed Age
# orders nightly Succeeded 12m 2m 12m

kubectl describe mysqlbackupverification orders-nightly-20260420
# ... includes status.backupRef, status.durationSeconds, conditions

The CR's owner references point back at the MysqlFailoverGroup, so deleting the group cascades through verifications along with the rest of its managed state.

Security

  • The verification Pod runs with the same hardened pod- and container-level SecurityContext as backup / restore Jobs (RunAsNonRoot, ReadOnlyRootFilesystem, RuntimeDefault seccomp, capability drop ALL).
  • Credentials flow through files mounted under /run/bloodraven/mysql-creds, never environment variables.
  • The ephemeral mysqld binds 127.0.0.1 only — no Service is created, so the verification instance is not reachable from anywhere outside the Pod's network namespace.

Known limitations

  • No backup encryption support — will follow wishlist #13.
  • No kubectl bloodraven verify-backup subcommand yet — will follow wishlist #18.
  • The Checking phase value is defined in the CRD enum but the reconciler does not currently transition through it at runtime; the sanity query runs inside the same Job container as the restore and the phase goes Restoring → Succeeded|Failed directly. Sanity results land on status.sanityCheck either way.