Skip to main content

Alert To Runbook Map

alert runbook map infographic

Keep this table synchronized with your PrometheusRule package. Alert annotations should link to the matching page or heading. Use the Observability Change Checklist when adding, removing, or changing alerts, alert annotations, runbook links, dashboard links, or actionable Events.

Alerts

AlertPrimary runbookFirst checks
BloodravenOperatorDownOperator unavailableDeployment, leader election, logs
BloodravenNoWritableSiteEmergency manual promotion or Total site lossActive site, pod reachability, fencing
BloodravenFailoverOccurredFailoverDNS, app writes, old primary state
BloodravenDivergentTransactionsDivergent old primary recoveryEvents, GTIDs, old primary fenced
BloodravenReplicationLaggingReplication lag highlag metric, MySQL replica status
BloodravenBackupStaleFailed backuplatest MysqlBackup, Job logs
BloodravenBackupVerificationStaleBackup Verificationverification CRs, restore logs
BloodravenPITRArchiveLaggingBackup And Restorearchiver metrics, object storage
BloodravenDNSUpdateFailedDNS failover stuckDNSEndpoint, external-dns logs
BloodravenSplitBrainDetectedSplit-brain recoverywritable sites, app traffic, GTIDs
BloodravenDragonflySiteDownDragonfly degradedstatus.dragonfly.sites[], Dragonfly pods, active Service endpoints
BloodravenDragonflyPromotionFailedDragonfly degradedplanned-failover Dragonfly status, Events, Redis client impact
BloodravenDragonflyManagerPanicDragonfly degradedoperator logs, panic counter, current status.dragonfly

Kubernetes Events

Event categoryExpected operator actionRunbook
Planned failover requested/started/completedDrain writes, promote target, update DNSPlanned Failover
Emergency failover started/completedPromote best candidate, taint old site, update DNSFailover
Reclone started/completed/failedRebuild replica from active primaryDivergent old primary recovery
Backup started/completed/failedCreate/update MysqlBackup Job and statusFailed backup
Restore started/completed/failedGate bootstrap or restore workflowFailed restore
Verification started/completed/failedRestore backup into ephemeral MySQLBackup Verification
DNS update created/failedWrite DNSEndpoint for external-dnsDNS failover stuck
Split-brain or recovery pendingFence losers or wait for manual recoverySplit-brain recovery
Dragonfly promotion/sync/upgrade eventsPreserve or restore cache/session continuity while MySQL remains authoritativeDragonfly degraded

Minimum alert annotations

Each alert should include:

  • summary with the failing group and namespace.
  • description with immediate user impact.
  • runbook_url pointing to this docs site.
  • dashboard_url pointing to the relevant Grafana dashboard.

If an alert intentionally has no runbook or dashboard link, the change must include a specific operational rationale in the pull request. Do not use a bare N/A.