Alert To Runbook Map

Keep this table synchronized with your PrometheusRule package. Alert annotations should link to the matching page or heading. Use the Observability Change Checklist when adding, removing, or changing alerts, alert annotations, runbook links, dashboard links, or actionable Events.

Alerts

Alert	Primary runbook	First checks
`BloodravenOperatorDown`	Operator unavailable	Deployment, leader election, logs
`BloodravenNoWritableSite`	Emergency manual promotion or Total site loss	Active site, pod reachability, fencing
`BloodravenFailoverOccurred`	Failover	DNS, app writes, old primary state
`BloodravenDivergentTransactions`	Divergent old primary recovery	Events, GTIDs, old primary fenced
`BloodravenReplicationLagging`	Replication lag high	lag metric, MySQL replica status
`BloodravenBackupStale`	Failed backup	latest `MysqlBackup`, Job logs
`BloodravenBackupVerificationStale`	Backup Verification	verification CRs, restore logs
`BloodravenPITRArchiveLagging`	Backup And Restore	archiver metrics, object storage
`BloodravenDNSUpdateFailed`	DNS failover stuck	`DNSEndpoint`, external-dns logs
`BloodravenSplitBrainDetected`	Split-brain recovery	writable sites, app traffic, GTIDs
`BloodravenDragonflySiteDown`	Dragonfly degraded	`status.dragonfly.sites[]`, Dragonfly pods, active Service endpoints
`BloodravenDragonflyPromotionFailed`	Dragonfly degraded	planned-failover Dragonfly status, Events, Redis client impact
`BloodravenDragonflyManagerPanic`	Dragonfly degraded	operator logs, panic counter, current `status.dragonfly`

Kubernetes Events

Event category	Expected operator action	Runbook
Planned failover requested/started/completed	Drain writes, promote target, update DNS	Planned Failover
Emergency failover started/completed	Promote best candidate, taint old site, update DNS	Failover
Reclone started/completed/failed	Rebuild replica from active primary	Divergent old primary recovery
Backup started/completed/failed	Create/update `MysqlBackup` Job and status	Failed backup
Restore started/completed/failed	Gate bootstrap or restore workflow	Failed restore
Verification started/completed/failed	Restore backup into ephemeral MySQL	Backup Verification
DNS update created/failed	Write `DNSEndpoint` for external-dns	DNS failover stuck
Split-brain or recovery pending	Fence losers or wait for manual recovery	Split-brain recovery
Dragonfly promotion/sync/upgrade events	Preserve or restore cache/session continuity while MySQL remains authoritative	Dragonfly degraded

Minimum alert annotations

Each alert should include:

summary with the failing group and namespace.
description with immediate user impact.
runbook_url pointing to this docs site.
dashboard_url pointing to the relevant Grafana dashboard.

If an alert intentionally has no runbook or dashboard link, the change must include a specific operational rationale in the pull request. Do not use a bare N/A.

Alerts​

Kubernetes Events​

Minimum alert annotations​

Alerts

Kubernetes Events

Minimum alert annotations