Operations Overview
Use this page during operational work to choose the safest runbook quickly.
Decision table
| Situation | Use | Do not start with |
|---|---|---|
| Planned maintenance on active site | Planned Failover | Emergency promotion |
| Active site is unreachable | Failover, then Runbooks | Reclone before confirming the new primary |
| Both sites appear writable | Network Partitions and Runbooks | App restarts only |
| Old primary has divergent GTIDs | Runbooks | Auto-rejoin |
| Backup failed | Troubleshooting | Delete all old backups |
| Restore failed | Troubleshooting | Retrying without checking source and credentials |
| Operator unavailable | Operator Availability | Disabling sidecar fencing |
| DNS not moving | Troubleshooting | Manual DB promotion |
| Dragonfly degraded | Runbooks and Monitoring | MySQL emergency promotion if MySQL is healthy |
On-call path
- Check the alert in Alert to Runbook Map.
- Check
MysqlFailoverGroupstatus and Kubernetes Events. - Run the matching runbook header checklist before taking action.
- Verify active site, DNS, replication, application writes, and Dragonfly status when enabled after remediation.
kubectl get mysqlfailovergroup orders -n orders -o wide
kubectl describe mysqlfailovergroup orders -n orders
kubectl get events -n orders --sort-by=.lastTimestamp
Related references
Test strategy and Operator SDK Scorecard
Bloodraven's test pyramid is unit tests (internal/**/*_test.go),
cross-package component tests with fakes (test/component/),
integration tests behind the Go integration build tag (run with
make test-integration), envtest controller tests against a real API
server (test/envtest/), and a Go-based chaos runner against a live k3d
cluster (cmd/playground-chaos, exposed via make chaos-list,
make chaos-run SCENARIO=<id>, and make chaos-run-all). CI runs lint,
build, unit, component, envtest, generate-check, and docs-build on every
PR (.github/workflows/ci.yml). Integration tests exist in the repo but
are not part of the default PR gate. The real-cluster end-to-end gate is
tracked separately as WISHLIST #32.
We evaluated the Operator SDK Scorecard
as an additional tier and declined adoption today. Scorecard is a
containerized test runner that takes an OLM bundle as input and runs tests
as Pods against a Kubernetes cluster. Its built-in OLM suite (five tests)
all read a ClusterServiceVersion; its built-in basic suite is the single
basic-check-spec-test; custom and kuttl scorecard paths both require a
bundle plus a live-cluster gate.
Why we declined today
We chose decline only after every condition below held against the repository at the time of the decision. If any becomes false, reopen WISHLIST #34 and re-run this rubric.
| ID | Condition | Today |
|---|---|---|
| R1 | No bundle.Dockerfile, no ClusterServiceVersion, no bundle/ directory, no config/scorecard/ kustomize templates exist. | True |
| R2 | No PROJECT file at the repo root (no operator-sdk init retrofit has been done). | True |
| R3 | No active plan to publish Bloodraven to OperatorHub or any OLM-distributed catalog (WISHLIST #30 remains open and explicitly conditional P3). | True |
| R4 | No external consumer of scapiv1alpha3.TestStatus JSON in the Bloodraven release pipeline, sibling repos, or platform tooling. | True |
| R5 | The existing pyramid (unit + component + envtest + the cmd/playground-chaos runner) already covers the failure modes Scorecard's basic and OLM suites would surface against a CSV-less, non-OLM operator, and emits richer forensic output (events, logs, raw /metrics) than scapiv1alpha3.TestStatus. | True |
The basic basic-check-spec-test payoff is trivial: it would not surface
a real defect against any custom resource sample shipped under
examples/. The OLM tests cannot run because we ship no CSV. The custom
and kuttl paths require both a bundle stub and a live-cluster gate
(WISHLIST #32), and the resulting harness duplicates work the existing
Go-based suites already do with richer output.
When to reopen WISHLIST #34
Reopen if any of the following becomes true:
- T1. Bloodraven publishes or commits to publish an OLM bundle —
bundle.Dockerfile, CSV, orconfig/scorecard/kustomize templates land in the repo. (Falsifies R1.) - T2. A
PROJECTfile is introduced oroperator-sdk initis run on the repository. (Falsifies R2.) - T3. Bloodraven adopts an external-distribution path that lists it on OperatorHub or an equivalent OLM catalog (e.g. WISHLIST #30 closes with that path chosen). (Falsifies R3.)
- T4. A downstream tool (CI,
platform/, sibling repo, certification flow) starts consumingscapiv1alpha3.TestStatusJSON. (Falsifies R4.) - T5. The real-cluster E2E gate (WISHLIST #32) ships and there is a documented argument that wrapping a subset of its assertions as scorecard custom tests is cheaper than maintaining them in Go. (R5 becomes worth re-evaluating.)
- T6. The Operator SDK project ships a meaningful basic-suite
expansion (beyond
basic-check-spec-test) that delivers signal a non-OLM operator could consume without a CSV.
When any trigger fires, the reopener should (a) flip the relevant rubric
row in this table to False with a one-line citation, (b) reopen #34 in
WISHLIST.md, and (c) cite this section as the prior art.