Troubleshooting

Use this page by symptom. If an alert fired, start with Alert To Runbook Map.

MysqlFailoverGroup not becoming Ready

Symptoms: Ready=False, Bootstrapping=True, MySQL pods exist but the group never settles.

Likely causes: node labels do not match taintNodeSelector, missing Secrets, StorageClass failure, failed clone, DNS CRD missing.

Inspect:

kubectl describe mysqlfailovergroup orders -n orders
kubectl get pods,pvc,events -n orders --sort-by=.lastTimestamp
kubectl logs -n bloodraven deploy/bloodraven

Safe remediation: fix the missing dependency first, then let reconciliation continue. Do not delete PVCs unless this is a disposable bootstrap.

MySQL pod stuck Pending

Symptoms: MySQL pod has no node, PVC unbound, or scheduler reports no matching nodes.

Likely causes: missing site labels, taints/tolerations mismatch, unavailable StorageClass, insufficient CPU or memory.

Inspect:

kubectl describe pod -n orders -l app.kubernetes.io/instance=orders
kubectl get nodes --show-labels
kubectl get pvc -n orders

Safe remediation: add required node labels, fix StorageClass, or lower resource requests after confirming capacity.

See also: Placement Contract.

MySQL pod CrashLoopBackOff

Symptoms: pod restarts repeatedly; MySQL or sidecar logs show bootstrap, TLS, or credential errors.

Likely causes: invalid MySQL config, missing TLS files, wrong Secret keys, incompatible image, failed clone restart.

Inspect:

kubectl logs -n orders pod/<mysql-pod> -c mysql --previous
kubectl logs -n orders pod/<mysql-pod> -c bloodraven-sidecar --previous
kubectl describe pod -n orders <mysql-pod>

Safe remediation: restore the last known-good image/config or fix the Secret/TLS reference. If clone completed and MySQL exited for a container restart, Kubernetes may recover it automatically.

See also: Credentials And TLS, Failure Mode Matrix.

Site marked unreachable

Symptoms: status.sites[].reachable=false, failover may start after threshold.

Likely causes: MySQL down, network policy block, node outage, Service endpoint missing, TLS or credential failure.

Inspect:

kubectl get mysqlfailovergroup orders -n orders -o yaml
kubectl get endpoints -n orders
kubectl get networkpolicy -n orders

Safe remediation: restore connectivity or let failover complete. Avoid manual promotion until you know which site accepted the last writes.

See also: Failover, Network Partitions.

DNS not changing after failover

Symptoms: status.activeSite changed but orders.az.example.com still resolves to the old IP.

Likely causes: DNSEndpoint CRD missing, external-dns not watching the namespace, provider credentials failure, TTL/cache delay.

Inspect:

kubectl get dnsendpoint -A | grep orders
kubectl logs -n external-dns deploy/external-dns
dig orders.az.example.com

Safe remediation: fix external-dns and wait for TTL expiry. If apps are writing to the old site, prioritize fencing and app routing.

See also: Runbooks, App Integration.

Dragonfly degraded or active Service empty

Symptoms: status.dragonfly.phase=Degraded, bloodraven_dragonfly_site_up==0, orders-dragonfly has no endpoints, or Events mention DragonflyPromotionFailed.

Likely causes: Dragonfly pod down, NetworkPolicy blocking port 6379 or admin port 9999, auth Secret mismatch, target replica syncing/loading, stale master kept out of the active Service, or snapshot restore in progress.

Inspect:

kubectl get mysqlfailovergroup orders -n orders \
  -o jsonpath='{.status.dragonfly}{"\n"}'
kubectl get pod -n orders -l app.kubernetes.io/name=dragonfly --show-labels
kubectl get endpoints orders-dragonfly -n orders
kubectl get events -n orders --sort-by=.lastTimestamp | grep -i dragonfly

Safe remediation: restore the failed Dragonfly pod or network path first. Do not force a stale master back into the active Service unless you have decided its cache/session contents are authoritative. MySQL remains authoritative for durable data.

See also: Runbooks, App Integration, Monitoring.

App still writing to old site

Symptoms: application errors or writes appear on the old primary after failover.

Likely causes: DNS cache too long, connection pool did not reconnect, app pinned a site Service, old site not fenced.

Inspect:

kubectl get mysqlfailovergroup orders -n orders -o jsonpath='{.status.activeSite}{"\n"}'
dig orders.az.example.com
kubectl get pods -n orders -o wide

Safe remediation: restart or drain affected app pods, lower DNS/client cache TTL, confirm old site is read-only.

See also: App Integration, Network Partitions.

Replication lag high

Symptoms: secondsBehindSource exceeds policy, alerts fire, replica may not be chosen for backups.

Likely causes: under-sized replica, slow network/storage, long transaction, backup load, SQL thread stopped.

Inspect:

kubectl get mysqlfailovergroup orders -n orders -o yaml
kubectl exec -n orders <replica-pod> -c mysql -- mysql -e 'SHOW REPLICA STATUS\G'

Safe remediation: reduce write load, investigate storage/network, restart replication only after recording the error.

See also: Monitoring, Durability And RPO.

Replication stopped

Symptoms: replica IO or SQL thread down, replicating=false.

Likely causes: credential drift, network failure, missing binlog, divergent transaction, MySQL error.

Inspect:

kubectl exec -n orders <replica-pod> -c mysql -- mysql -e 'SHOW REPLICA STATUS\G'
kubectl describe mysqlfailovergroup orders -n orders

Safe remediation: fix the underlying error. If GTIDs diverged, use the divergent old primary recovery runbook rather than skipping transactions.

See also: Runbooks.

Divergent transactions detected

Symptoms: old primary cannot rejoin; Events or logs mention divergent GTIDs.

Likely causes: old primary accepted writes that never reached the promoted site.

Inspect:

kubectl describe mysqlfailovergroup orders -n orders
kubectl get events -n orders --sort-by=.lastTimestamp | grep -i divergent

Safe remediation: preserve evidence, decide whether to extract missing data manually, then reclone the divergent site from the active primary.

See also: Runbooks, Durability And RPO.

Backup Job failing

Symptoms: MysqlBackup.status.phase=Failed, backup Job failed.

Likely causes: S3 credentials, bucket policy, PVC full, MySQL credentials, staging volume eviction, source replica lag.

Inspect:

kubectl describe mysqlbackup -n orders <backup-name>
kubectl logs -n orders job/<backup-job-name>
kubectl get events -n orders --sort-by=.lastTimestamp

Safe remediation: fix credentials/storage first, then create a new manual backup. Do not lower retention until at least one newer backup succeeds.

See also: S3 Backups, PVC Backups.

Restore Job failing

Symptoms: restore status is failed, restore Job exits non-zero.

Likely causes: wrong artifact path, missing decryption passphrase, S3/PVC access, incompatible MySQL shell/server version.

Inspect:

kubectl describe mysqlfailovergroup orders-restore -n orders
kubectl logs -n orders job/<restore-job-name>
kubectl get pods,pvc -n orders

Safe remediation: verify source artifact and credentials, then recreate the recovery group if bootstrap state is partial. Preserve failed restore logs.

See also: Backup And Restore, Backup Encryption.

Grafana dashboards show no data

Symptoms: dashboards import but panels are empty.

Likely causes: Prometheus is not scraping Bloodraven, wrong datasource variable, dashboard ConfigMaps in an unwatched namespace.

Inspect:

kubectl get servicemonitor -n bloodraven
kubectl port-forward -n bloodraven deploy/bloodraven 8080:8080
curl http://localhost:8080/metrics | grep '^bloodraven_'

Safe remediation: fix scraping first, then dashboard datasource selection.

See also: Prometheus Setup, Grafana Dashboards.

Operator logs show permission denied

Symptoms: reconcile errors mention forbidden Kubernetes API operations.

Likely causes: RBAC drift, namespace-scoped install missing permissions, Helm chart RBAC not updated with generated RBAC.

Inspect:

kubectl auth can-i list mysqlfailovergroups.shipstream.io --as=system:serviceaccount:bloodraven:bloodraven
kubectl describe clusterrole bloodraven
kubectl logs -n bloodraven deploy/bloodraven

Safe remediation: restore chart RBAC from the current release. Do not grant wildcard permissions unless your platform policy explicitly accepts it.

See also: Security Model, Production Install.

MysqlFailoverGroup not becoming Ready​

MySQL pod stuck Pending​

MySQL pod CrashLoopBackOff​

Site marked unreachable​

DNS not changing after failover​

Dragonfly degraded or active Service empty​

App still writing to old site​

Replication lag high​

Replication stopped​

Divergent transactions detected​

Backup Job failing​

Restore Job failing​

Grafana dashboards show no data​

Operator logs show permission denied​

MysqlFailoverGroup not becoming Ready

MySQL pod stuck Pending

MySQL pod CrashLoopBackOff

Site marked unreachable

DNS not changing after failover

Dragonfly degraded or active Service empty

App still writing to old site

Replication lag high

Replication stopped

Divergent transactions detected

Backup Job failing

Restore Job failing

Grafana dashboards show no data

Operator logs show permission denied