Skip to main content

Production hardening checklist

production hardening infographic

Work through this before labeling a Bloodraven deployment "production". Each bullet links to the page with the setting's detail. Items are grouped; within a group, order doesn't matter.

Requirement levels:

LevelMeaning
RequiredMust be satisfied before production traffic.
RecommendedStrong default unless your platform has an equivalent control.
Environment-specificRequired only when that feature or environment applies.

Owner hints:

OwnerTypical responsibilities
PlatformOperator install, CRDs, RBAC, node labels, NetworkPolicy, monitoring
DatabaseMySQL sizing, backup, restore, PITR, failover drills
App teamConnection pools, DNS caching, write/read endpoint use, app alerts
SecuritySecrets, TLS, image provenance, network boundaries

Image supply chain

  • Pin spec.image, spec.sidecarImage, and spec.backup.image to immutable tags (or digests). No :latest, no MySQL :9, no :edge. A silently-drifting MySQL image breaks dump/restore and sometimes replication. See the Upgrade and version-skew policy for which MySQL tags are supported.
  • Mirror the operator, sidecar, and backup images to a private registry. A DockerHub rate-limit or Oracle registry outage should never be able to stop a pod from starting mid-incident.
  • Verify images are pulled using imagePullSecrets scoped to the namespace; service-account-wide pulls leak credentials across workloads.
  • Enable imagePullPolicy: IfNotPresent once tags are immutable.
  • If spec.dragonfly.enabled=true, pin spec.dragonfly.image to Dragonfly v1.38.0+, mirror it with the other runtime images, and do not use :latest.

Credentials

  • Use spec.credentials (per-role Secrets), not the legacy spec.secretName DSN mode. Per-role credentials are required for the backup-verification and PITR paths to hold least privilege. See CRD Reference → CredentialsSpec.
  • Populate operatorSecret, appSecret, readOnlySecret, monitorSecret, and backupSecret — five distinct users.
  • Store Secrets with a provider that supports rotation (External Secrets Operator, sealed-secrets, Vault). Rotate regularly.
  • Supply MYSQL_REPLICATION_USER / MYSQL_REPLICATION_PASSWORD explicitly. Without them, automatic old-primary recovery is skipped — see Failover → Prerequisites.

TLS

  • Set spec.tls with a cert-manager issuer. Bloodraven forces require-secure-transport when TLS is configured, so all client traffic is encrypted in-cluster.
  • Use a separate issuer for MySQL TLS from the one your mesh / ingress uses, so rotating the MySQL cert doesn't churn unrelated workloads.

Failover tuning

  • Set spec.failoverCooldown to ≥ 5 m (the default). Lower values increase the risk of cascade failovers during unstable infrastructure; higher values extend MTTR after a flapping primary comes back.
  • Set spec.replication.maxLagSeconds based on your RPO tolerance. This is an alerting threshold (Degraded=True with reason ReplicationLagging), not a promotion gate — treat it as a page-worthy signal, not a soft warning.
  • Use spec.updateStrategy: OrderedUpdate (the default). The alternative, Recreate, restarts both MySQL pods simultaneously on any spec change and creates a write outage. Ordered updates roll the replica first, fail over, then roll the former primary — zero write downtime for typical spec changes.
  • If spec.dragonfly.enabled=true, choose spec.dragonfly.plannedFailover.onSyncTimeout deliberately. The default proceed protects MySQL availability and may discard cache/session continuity; fail preserves the old active site when Dragonfly session preservation cannot be proven.

Storage

  • Use a network-attached, topology-aware StorageClass for the MySQL data PVCs (EBS + WaitForFirstConsumer, Persistent Disk, Ceph RBD, etc.). Never hostPath or emptyDir.
  • Confirm the StorageClass's reclaimPolicy is Retain for the MySQL data PVCs. Delete + a routine kubectl delete pvc turns a pod-restart case into a full reclone.
  • Confirm the StorageClass supports allowVolumeExpansion: true if you anticipate growing storage. Resizing without expansion support means a data migration.

PITR / backups

  • Enable PITR (spec.backup.pitr.enabled: true) with a maxBinlogSize sized for your RPO target. Default 100M balances rotation overhead against unarchived-tail size; smaller if your write rate is high.
  • Pick a backup profile storage that is in a different fault domain from the MySQL PVCs (different region / different provider). Backups stored on the same disk as the DB are not backups.
  • Configure at least one recurring schedule under spec.backup.schedules[] and pin its timeZone. See Backup and restore → TimeZone.
  • Set a retention policy (retentionPolicy.count, maxAgeDays, minKeep, maxFailedKeep) on every profile. An unbounded retention is a bill, not a backup strategy.
  • Run backup-verification externally (restore into a throwaway namespace periodically, assert a sanity query works). Unverified backups are Schrödinger backups.
  • Encrypt backups at rest. Set spec.backup.profiles[].encryption.passphraseSecret on every profile that writes to shared storage. Client-side envelope encryption (AES-256-GCM) keeps the passphrase in a Kubernetes-controlled Secret, so a compromised S3 credential does not expose backup contents. See Backup encryption. Back the passphrase Secret up out-of-band — losing it makes the backups unrecoverable.

Placement

  • Label nodes with a fault-domain zone topology label (e.g. topology.kubernetes.io/zone=zone-iad). Bloodraven's per-site spec.sites[].zone targets nodes via nodeAffinity — without correct node labels the sites will co-locate.
  • Set per-site resources.requests generously enough that MySQL has breathing room; resources.limits optional for memory only if you're also using innodb_buffer_pool_size tuning.
  • Honour the placement contract: apps must respect the shipstream.io/db-readonly-<group> node taint that Bloodraven applies to losing sites after a failover.

Observability

  • Scrape bloodraven on :8080/metrics and alert on:
    • bloodraven_state_transitions_total spikes
    • bloodraven_failovers_total increase (at least a warning)
    • bloodraven_replication_lag_seconds sustained > your threshold
    • bloodraven_divergent_transactions > 0 (critical — data loss)
    • up{job="bloodraven"} == 0 for > 2 × pollInterval (operator is down; see Monitoring for the full list and recommended rules)
    • bloodraven_dragonfly_site_up == 0 and bloodraven_dragonfly_promotions_total{result="failed"} when spec.dragonfly.enabled=true
  • Ship a PrometheusRule resource alongside your install with the above expressions baked in. The Helm chart installs a PodMonitor but not alerting rules; curate your own.
  • Forward Kubernetes Events for MysqlFailoverGroup and MysqlBackup CRs somewhere humans read. See Monitoring → Kubernetes Events.
  • Scrape your sidecars too: PITR archiver status is surfaced via Prometheus gauges the operator exports (labeled by site).
  • Publish / log dashboards — Grafana, otherwise the first post-incident question becomes "was this visible?".

Control-plane availability

  • Keep the operator's pod up. Use a tight liveness probe and restartPolicy: Always. Size the PodDisruptionBudget so the operator isn't evicted during routine node drains — the shipped chart does not ship a PDB for the operator; add one.
  • Leader election is on by default and stays on. Running multiple operator replicas is fine but doesn't speed up MTTR; the design tradeoffs are in the README's "Non-HA control plane" note.

Network

  • Configure NetworkPolicy so the operator Pod can reach both MySQL Services and the sidecar HTTP endpoint on :8080. A silently-blocking CNI is the #1 source of spurious unreachable states in the state machine.
  • If spec.dragonfly.enabled=true, allow the operator to reach the per-site Dragonfly client and admin ports (6379 and 9999 by default), and allow applications to reach only the active <group>-dragonfly Service.
  • Make sure external-dns (or whatever consumes the DNSEndpoint CR) is running. Failover updates the CR, but the actual DNS record only changes when the external-dns pod reconciles. Monitor its lag.
  • If you're using TLS with a private CA, mount the CA into the sidecar and operator pods so their MySQL connections trust it.

Opt-in Restricted PSS for MySQL and Dragonfly

The MySQL Deployments and Dragonfly StatefulSets render with no pod-level or container-level securityContext by default, preserving backward compat with existing PVCs whose files were created under the image's default uid. Existing clusters keep their current PodSpec unchanged on upgrade.

When you are ready to put the MySQL and Dragonfly pods under the Kubernetes Restricted Pod Security Standard, fill in the new opt-in fields on the MysqlFailoverGroup CR. The operator applies the value verbatim to the workload — it does not merge it with hardened defaults, because the right uid/gid depend on the image you run. The examples below target the upstream mysql:9.6 and Dragonfly images, both of which run as uid 999.

MySQL — spec.podSecurityContext / spec.containerSecurityContext

apiVersion: shipstream.io/v1alpha1
kind: MysqlFailoverGroup
metadata:
name: orders
spec:
podSecurityContext:
runAsNonRoot: true
runAsUser: 999
runAsGroup: 999
fsGroup: 999
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: false
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
# ...rest of spec unchanged

The container's readOnlyRootFilesystem is false because the upstream mysql image writes temp files outside the data PVC (/tmp, /var/run/mysqld, /etc/mysql/conf.d for entrypoint scratch). Leaving the root filesystem writable here is intentional; the PVC is what matters for durability. If your image supports a read-only root with explicit emptyDir mounts for those paths, set readOnlyRootFilesystem: true and add the mounts via your own pod template overlay.

The fsGroupChangePolicy: OnRootMismatch setting tells Kubernetes to skip the recursive chown on the data volume when the top-level directory is already owned by fsGroup. On a fresh PVC the chown runs once; on a PVC that was already migrated it is a no-op. See Upgrading existing clusters below for the migration path on a PVC that was created without an fsGroup.

spec.containerSecurityContext is applied to both the mysql and the sidecar containers. The two containers share a uid/gid because they share the data volume.

Dragonfly — spec.dragonfly.podSecurityContext / spec.dragonfly.containerSecurityContext

apiVersion: shipstream.io/v1alpha1
kind: MysqlFailoverGroup
metadata:
name: orders
spec:
dragonfly:
enabled: true
podSecurityContext:
runAsNonRoot: true
runAsUser: 999
runAsGroup: 999
fsGroup: 999
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containerSecurityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
seccompProfile:
type: RuntimeDefault
# ...rest of dragonfly spec unchanged

Dragonfly does support a read-only root filesystem in this configuration: its only writable path is the data volume where it persists its RDB snapshot. Set readOnlyRootFilesystem: true.

Upgrading existing clusters

If you already run Bloodraven in production and your MySQL data PVCs were created before these fields existed, the files on disk are owned by whatever uid the image originally ran as (uid 999 for upstream mysql:9.6, something else for a fork). Naively applying fsGroup: 999 to a PVC created by a pod that ran as uid 0 will trigger a recursive chown on every pod start that can take many minutes on a large data volume — Kubernetes holds the pod in ContainerCreating for the duration, which on a primary is a write outage.

Two safe migration paths:

Option A — let Kubernetes handle it once with OnRootMismatch. If you are willing to take one extended pod-start window per pod, set fsGroupChangePolicy: OnRootMismatch (as in the example above) and apply the spec change. Kubernetes will run the recursive chown the first time each pod restarts, then notice the root directory is already correct and skip on subsequent starts. Roll the StatefulSet one pod at a time during a maintenance window; do not restart both sites at once.

Option B — chown the data offline first. For very large volumes (1 TB+) or operations teams that prefer to control the chown, run an offline Job against each MySQL PVC after scaling the StatefulSet down to zero, then apply the spec change. Example:

apiVersion: batch/v1
kind: Job
metadata:
name: orders-iad-chown
namespace: orders
spec:
template:
spec:
restartPolicy: Never
containers:
- name: chown
image: registry.k8s.io/build-image/debian-base:bookworm-v1.0.2
command:
- sh
- -c
- chown -R 999:999 /var/lib/mysql
volumeMounts:
- name: data
mountPath: /var/lib/mysql
securityContext:
runAsUser: 0
volumes:
- name: data
persistentVolumeClaim:
# Adjust the claim name to the MysqlFailoverGroup's per-site PVC.
claimName: data-orders-iad-0

Apply the Job per site, wait for it to complete, then apply the MysqlFailoverGroup spec change and scale the StatefulSet back up. Repeat on the second site after the first one is healthy.

Service account tokens on backup, restore, cleanup, and verification Jobs

Backup, restore, restore-in-place, cleanup, and verification execution Jobs now render with automountServiceAccountToken: false. None of these jobs call the Kubernetes API at runtime — they execute mysqlsh, cleanup.py, or verify.sh and communicate with MySQL over TCP — so the SA token has no purpose and is one less credential to leak from a compromised pod.

Schedule-trigger CronJob pods (the MysqlBackup / MysqlBackupVerification creator pods, not the execution Jobs they create) keep the SA token because they POST a backup or verification CR through the in-cluster API and need their ServiceAccount's create permission.

Disaster readiness

  • Rehearse failover on the playground (./playground/chaos.sh kill-site iad) and confirm your application behaves.
  • Rehearse PVC loss end-to-end at least once per quarter — delete a PVC, watch the operator auto-clone, confirm data is intact.
  • Write down the non-obvious bits that only you know (donor secrets, runbook phone numbers, which engineer holds which credential). A production system that only one person can recover is a liability.
  • Periodically check that the PITR archive has current manifests and that status.pitr.newestArchivedTime is close to "now". A stale archive is a silent RPO regression.

Change management

  • Validate CRD spec changes with kubectl apply --dry-run=server before rolling them to production. The CRD has several opinionated defaults that can silently re-align your config on apply.
  • When rotating the sidecar image independently of the MySQL image (spec.sidecarImage), the OrderedUpdate path handles it automatically. Don't kubectl delete pod to force the rollout — let the operator drive it.
  • Never force-promote a site by patching CR status directly. The operator reads status as truth from the data plane; tampering with it leads to split-brain. Use the manual-promotion procedure in Operations → Manual promotion if you truly need to override.

Go-live gate

Copy this checklist into the production release ticket:

  • CRDs installed and owned by the correct deployment system.
  • Operator image pinned.
  • Sidecar image pinned.
  • MySQL image pinned.
  • Backup image pinned.
  • Dragonfly image pinned if enabled.
  • Node labels correct for every site.
  • StorageClass tested for scheduling, expansion, and reclaim behavior.
  • external-dns tested with the production DNS provider.
  • TLS enabled for MySQL.
  • Per-role credentials configured.
  • Backups scheduled.
  • Restore verified.
  • PITR enabled if required by RPO.
  • Prometheus scraping enabled.
  • Grafana dashboards installed.
  • Minimum alerts installed and linked to runbooks.
  • NetworkPolicy installed.
  • Failover tested.
  • Application connection behavior tested.
  • Dragonfly active Service and Redis-client reconnect behavior tested if enabled.
  • On-call runbooks reviewed by responders.