Skip to main content

Playground

playground infographic

The playground deploys a fully working Bloodraven setup on your local machine so you can watch failovers happen in real time. It runs inside any multi-node Kubernetes cluster and includes a live dashboard, a counter app that proves data survives failovers, and a simulated DNS pipeline.

No cloud account, no DNS provider, no production infrastructure required.

The playground is the recommended learning path before production. It lets you see failover, DNS steering, taints, backup, restore, and dashboard state without touching real infrastructure.

Bloodraven Playground showing the dashboard and counter app

What you get

  • Two-site MySQL cluster (IAD + PDX) managed by the Bloodraven operator
  • Real-time dashboard showing site health, replication state, DNS records, and an event log
  • Counter app that writes to MySQL through the primary service, proving state persists across failovers
  • Simulated external-dns pipeline — the operator creates DNSEndpoint CRs, external-dns watches them and pushes records to an in-memory webhook provider, and the dashboard displays the results
  • Chaos tools for triggering failovers by killing pods, cordoning nodes, or simulating network partitions

Prerequisites

You need three tools installed:

  • docker or podman — for building container images. Docker is preferred because k3d's podman support is experimental and the image-load path is faster on docker. Set BLOODRAVEN_CONTAINER_RUNTIME=podman to force podman if both are installed.
  • kubectl — for talking to your cluster
  • helm — for deploying the operator

And a local Kubernetes cluster with at least 2 worker nodes. We recommend k3d because it's fast and lightweight, but kind and minikube work too.

# Install k3d if you haven't already
curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

# Create a cluster with 2 worker nodes
k3d cluster create bloodraven --agents 2

That's it. You now have a 3-node cluster (1 server + 2 agents) ready to go.

Using kind or minikube instead

kind:

cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
EOF

minikube:

minikube start --nodes=2 --cpus=2 --memory=2048 --driver=docker

Setup

From the root of the repository:

./playground/setup.sh

This single script handles everything:

  1. Verifies your cluster has at least 2 nodes
  2. Labels nodes to simulate two data center sites (IAD and PDX)
  3. Builds all container images locally
  4. Loads images into your cluster (auto-detects k3d, kind, or minikube)
  5. Installs the CRDs, namespace, RBAC, and operator via Helm
  6. Creates a MysqlFailoverGroup with two MySQL sites
  7. Seeds a DNS record so the external-dns pipeline works immediately
  8. Deploys the dashboard and counter app

Setup takes about 2 minutes depending on your machine.

Expected timings

ActionTypical time
Full setupAbout 2 minutes
Pod-kill failoverAbout 30-45 seconds when relay-log drain waits on a dead primary
Operator rebuildAbout 30-90 seconds after image build
Sidecar rebuildSeveral minutes because MySQL pods restart
Local backup testDepends on data size; small playground data is usually under 1 minute
Local restore testDepends on data size; small playground data is usually a few minutes

Script to production concept map

ScriptProduction concept
./playground/setup.shCRD install, Helm install, tenant failover group bootstrap
./playground/rebuild.sh operatorOperator rollout after controller changes
./playground/rebuild.sh sidecarMySQL pod rolling update after sidecar changes
./playground/chaos.sh kill-site iadPrimary crash and emergency failover
./playground/chaos.sh cordon pdxSite capacity or scheduling failure
./playground/chaos.sh recoverInfrastructure recovery after an incident
./playground/reset-mysql.shDestructive lab reset, not a production recovery command

Reset when things go wrong

./playground/chaos.sh recover
./playground/reset-mysql.sh
./playground/rebuild.sh operator sidecar

First-run divergence (one-time reclone)

On a brand-new cluster, both MySQL pods come up writable before the operator picks an active site. The pod that loses the election may have already executed a few transactions of its own (server UUID), which the operator surfaces as RecoveryPending with a non-zero divergentTransactionCount. Until that's cleared, planned failover to the loser will be rejected with "site is read-only but not replicating".

To resolve, copy the divergent-GTID prefix from kubectl get mysqlfailovergroup playground -o yaml (look for divergentGtid) and confirm the reclone:

# Inspect the divergence
kubectl -n bloodraven-playground get mysqlfailovergroup playground -o yaml \
| grep -A1 divergentGtid

# Annotate with <site>:<first-segment-of-divergent-gtid> as the confirmation token.
# Example: divergentGtid 61553741-443a-11f1-... → reclone-site=pdx:61553741
kubectl -n bloodraven-playground annotate mysqlfailovergroup playground \
bloodraven.shipstream.io/reclone-site=pdx:61553741 --overwrite

The reclone takes ~60s. Expected outcome: state=read-only, replicating=true, recoveryState cleared. After that the planned failover demo below will work.

Access the apps

Use kubectl port-forward to access the dashboard and counter app:

# Dashboard (real-time cluster visualization)
kubectl -n bloodraven-playground port-forward svc/dashboard 8091:8091

# Counter app (write-through-failover demo)
kubectl -n bloodraven-playground port-forward svc/counter-app 8090:8090

Then open http://localhost:8091 for the dashboard and http://localhost:8090 for the counter app.

tip

Remote access Add --address 0.0.0.0 to port-forward if you want to reach the apps from another machine (e.g. over Tailscale).

Try a failover

The dashboard toolbar has buttons that copy kubectl commands to your clipboard. You can also run them directly:

# Kill the IAD MySQL pod — the operator will detect the outage and fail over to PDX
kubectl delete pod -n bloodraven-playground -l shipstream.io/site=iad

# Or cordon the IAD node to simulate a full site outage
kubectl cordon $(kubectl get nodes -l topology.kubernetes.io/zone=zone-iad -o name)

Watch the dashboard — you'll see the site state change, the health banner update, DNS records flip, and the counter app reconnect to the new primary.

To restore the cordoned node:

kubectl uncordon $(kubectl get nodes -o name | tr '\n' ' ')

Chaos script

For more advanced scenarios, use the chaos script:

./playground/chaos.sh kill-site iad # Kill MySQL+Dragonfly pods at a site
./playground/chaos.sh kill-site pdx
./playground/chaos.sh cordon iad # Cordon a site's node
./playground/chaos.sh network-partition iad # Simulate a network partition
./playground/chaos.sh kill-dragonfly iad # Kill only the Dragonfly pod at a site
./playground/chaos.sh dragonfly-status # Print Dragonfly roles, traffic labels, and active endpoints
./playground/chaos.sh recover # Undo all chaos

Dragonfly co-management

The playground MysqlFailoverGroup enables spec.dragonfly, so Bloodraven also creates one Dragonfly StatefulSet per site (playground-dragonfly-iad, playground-dragonfly-pdx), one Dragonfly PodDisruptionBudget per site, and a single app-facing playground-dragonfly Service whose endpoints follow the active site. The Service selector AND-gates shipstream.io/dragonfly-role=master AND shipstream.io/dragonfly-traffic=enabled: removing the traffic label sheds an endpoint atomically, which is how planned failover avoids a window where both the old and new master would match the selector during REPLTAKEOVER. Disable Dragonfly by removing the spec.dragonfly block.

Bloodraven supports Dragonfly v1.38.0+ for managed deployments. The playground pins docker.dragonflydb.io/dragonflydb/dragonfly:v1.38.0; avoid older tags because several replication, snapshot, and expiry bugs were fixed before this baseline.

Bloodraven treats Dragonfly as cache/session state, not durable application data. The operator provisions ephemeral storage, does not manage Dragonfly backups or snapshot schedules, and does not support DFLY LOAD against a live master with attached replicas because loaded data bypasses the replication journal. Topology changes can force full replica resyncs and briefly increase master latency, so schedule planned failovers and Dragonfly image rollouts the same way you would schedule cache-impacting maintenance.

The baseline playground does not enable spec.dragonfly.snapshot. That keeps normal Dragonfly pods independent of the optional RustFS/S3 path; Dragonfly v1.38 exits during startup when an S3 snapshot directory is configured but the bucket or credentials are unavailable. The D6a snapshot-restore upgrade scenario provisions the RustFS bucket on demand, temporarily enables the snapshot config, and validates that Dragonfly pods restart with the S3 snapshot directory before it requests the upgrade.

The operator owns the safety-critical Dragonfly flags it emits (--port, --admin_port, --requirepass, --break_replication_on_master_restart, and related sizing knobs). Extra spec.dragonfly.args are an escape hatch for site-specific tuning; do not use them to enable tiered storage, ACL files, Lua compatibility relaxations, TLS replication, snapshot scheduling, or load/import workflows unless the operational tradeoff is documented for that deployment.

Verify the cache subsystem before exercising failover:

# Both per-site StatefulSets reach Ready
kubectl -n bloodraven-playground get statefulset -l app.kubernetes.io/name=dragonfly

# Operator's view of which site is master, plus pod role/traffic labels
# and the active Service endpoints
./playground/chaos.sh dragonfly-status

For ad-hoc Redis-protocol queries (replication state, GET/SET against a key), launch a one-shot pod with redis-cli:

kubectl -n bloodraven-playground run redis-cli --rm -it --restart=Never \
--image=redis:7-alpine -- redis-cli -h playground-dragonfly INFO replication

The counter app (http://localhost:8090 after port-forward) writes to both MySQL and Dragonfly on every increment. The MySQL counter is durable; the Dragonfly counter (shown as "Cache (Dragonfly)" below the main number) is the session/cache continuity signal. After a planned failover with sessionsPreserved=true, the Dragonfly counter survives. After an emergency failover, it usually resets to 0 because the new master may have been an unsynced replica or a freshly-promoted empty pod.

Exercise a planned failover that includes Dragonfly session preservation:

# Click "+ Increment" a few times in the counter UI so both counters are non-zero.

# Trigger a planned failover to pdx
kubectl -n bloodraven-playground annotate mysqlfailovergroup playground \
bloodraven.shipstream.io/planned-failover=pdx

# Watch the planned-failover status walk WaitingForLag → WaitingForDragonflySync
# → PromotingDragonfly → Promoting → Resuming → Succeeded.
kubectl -n bloodraven-playground get mysqlfailovergroup playground \
-o jsonpath='{.status.plannedFailover.phase}{"\n"}'

# After Succeeded, sessionsPreserved should be true on the success path.
kubectl -n bloodraven-playground get mysqlfailovergroup playground \
-o jsonpath='{.status.plannedFailover.dragonfly}{"\n"}'

# Reload the counter UI — both the MySQL and the Dragonfly counter should
# be unchanged. The active Service now resolves to the pdx Dragonfly pod.

Emergency Dragonfly behavior is best-effort. Killing the active MySQL pod while Dragonfly is healthy still completes the MySQL emergency failover; Dragonfly is promoted via REPLTAKEOVER (sessions preserved when reachable) or REPLICAOF NO ONE (sessions lost), and never blocks MySQL recovery past a 10-second budget.

The D6a snapshot-restore upgrade path can be exercised directly:

make chaos-run SCENARIO=29-dragonfly-snapshot-upgrade

The ordinary Dragonfly image rollout path is covered separately and patches the image to a cached digest reference so it does not depend on pulling a new external tag:

make chaos-run SCENARIO=27-dragonfly-rolling-image-update

Exercise backup verification

The playground's minio backup profile ships with scheduled verification enabled (*/30 * * * *). The operator materializes a CronJob that fires bloodraven trigger-verification on the schedule, which creates a MysqlBackupVerification CR against the latest Succeeded MysqlBackup and copies the profile's verification block (sanityCheck, etc.) onto the CR.

To drive the feature by hand without waiting for the schedule:

# Create a bare verification and wait for it to reach a terminal phase.
# This skips profile-level inheritance (no sanityCheck / pointInTime);
# see below for the scheduled-contract path.
./playground/verify-backup.sh run minio

# List all verifications with their phase
./playground/verify-backup.sh status

# Tail the Job pod log for the most recent verification
./playground/verify-backup.sh logs

# Remove completed runs
./playground/verify-backup.sh cleanup # Succeeded only
./playground/verify-backup.sh cleanup --failed # Failed only

# Inspect the scheduled CronJob the operator materialized
./playground/verify-backup.sh schedule-list

To run the full scheduled contract on demand — the CR built exactly the way the CronJob would build it, including the profile's sanityCheck — fire the operator's CronJob as a one-off Job:

kubectl -n bloodraven-playground create job verify-now \
--from=cronjob/mysql-playground-verify-minio

A successful run provisions an ephemeral PVC + mysqld, loads the dump, runs the sanity query (if inherited from the profile), and cleans up. On failure the Pod and PVC are retained so you can kubectl exec in and inspect why the load failed.

note

The playground does not configure PITR binlog archival, so spec.pointInTime on a verification CR will be rejected by the reconciler until spec.backup.pitr.enabled=true (and a binlog archiver) is wired up on the failover group.

Dashboard features

The dashboard connects to the operator via WebSocket and polls DNS state every 3 seconds.

  • Health banner — shows overall cluster health at a glance: Healthy (green), Degraded (amber, e.g. replica not replicating), or critical states like Split Brain or No Primary (red, pulsing)
  • Site cards — shows each site's state (writable, read-only, unreachable), GTID position, replication status, and lag
  • DNS records — shows what external-dns has published, with a fallback to reading DNSEndpoint CRs directly from the Kubernetes API
  • Event log — real-time stream of state transitions, failovers, and DNS changes

Rebuilding after code changes

When you change Go code in the operator, sidecar, or any playground app, use the rebuild script to update your running cluster:

# Rebuild everything
./playground/rebuild.sh

# Rebuild only specific components
./playground/rebuild.sh dashboard # just the dashboard
./playground/rebuild.sh counter # just the counter app
./playground/rebuild.sh operator sidecar # operator + sidecar

The script builds the container images (using docker if available, otherwise podman; set BLOODRAVEN_CONTAINER_RUNTIME=podman to force podman), loads them into your cluster, restarts the affected deployments, and waits for rollout to complete.

Valid component names: operator, sidecar, counter, dashboard, dns-webhook.

Useful commands

# Check the failover group status
kubectl -n bloodraven-playground get mysqlfailovergroups

# List all pods
kubectl -n bloodraven-playground get pods

# Check DNS endpoint CRs
kubectl -n bloodraven-playground get dnsendpoints

# Follow operator logs
kubectl -n bloodraven-playground logs -l app.kubernetes.io/name=bloodraven -f

Teardown

To remove everything and start fresh:

./playground/teardown.sh

This deletes the namespace, Helm release, CRDs, and node labels. Your k3d/kind/minikube cluster itself is left intact so you can re-run setup.sh without recreating it.