Skip to main content

Troubleshooting

troubleshooting infographic

Use this page by symptom. If an alert fired, start with Alert To Runbook Map.

MysqlFailoverGroup not becoming Ready

Symptoms: Ready=False, Bootstrapping=True, MySQL pods exist but the group never settles.

Likely causes: node labels do not match taintNodeSelector, missing Secrets, StorageClass failure, failed clone, DNS CRD missing.

Inspect:

kubectl describe mysqlfailovergroup orders -n orders
kubectl get pods,pvc,events -n orders --sort-by=.lastTimestamp
kubectl logs -n bloodraven deploy/bloodraven

Safe remediation: fix the missing dependency first, then let reconciliation continue. Do not delete PVCs unless this is a disposable bootstrap.

See also: Getting Started, Placement Contract, CRD Reference.

MySQL pod stuck Pending

Symptoms: MySQL pod has no node, PVC unbound, or scheduler reports no matching nodes.

Likely causes: missing site labels, taints/tolerations mismatch, unavailable StorageClass, insufficient CPU or memory.

Inspect:

kubectl describe pod -n orders -l app.kubernetes.io/instance=orders
kubectl get nodes --show-labels
kubectl get pvc -n orders

Safe remediation: add required node labels, fix StorageClass, or lower resource requests after confirming capacity.

See also: Placement Contract.

MySQL pod CrashLoopBackOff

Symptoms: pod restarts repeatedly; MySQL or sidecar logs show bootstrap, TLS, or credential errors.

Likely causes: invalid MySQL config, missing TLS files, wrong Secret keys, incompatible image, failed clone restart.

Inspect:

kubectl logs -n orders pod/<mysql-pod> -c mysql --previous
kubectl logs -n orders pod/<mysql-pod> -c bloodraven-sidecar --previous
kubectl describe pod -n orders <mysql-pod>

Safe remediation: restore the last known-good image/config or fix the Secret/TLS reference. If clone completed and MySQL exited for a container restart, Kubernetes may recover it automatically.

See also: Credentials And TLS, Failure Mode Matrix.

Site marked unreachable

Symptoms: status.sites[].reachable=false, failover may start after threshold.

Likely causes: MySQL down, network policy block, node outage, Service endpoint missing, TLS or credential failure.

Inspect:

kubectl get mysqlfailovergroup orders -n orders -o yaml
kubectl get endpoints -n orders
kubectl get networkpolicy -n orders

Safe remediation: restore connectivity or let failover complete. Avoid manual promotion until you know which site accepted the last writes.

See also: Failover, Network Partitions.

DNS not changing after failover

Symptoms: status.activeSite changed but orders.az.example.com still resolves to the old IP.

Likely causes: DNSEndpoint CRD missing, external-dns not watching the namespace, provider credentials failure, TTL/cache delay.

Inspect:

kubectl get dnsendpoint -A | grep orders
kubectl logs -n external-dns deploy/external-dns
dig orders.az.example.com

Safe remediation: fix external-dns and wait for TTL expiry. If apps are writing to the old site, prioritize fencing and app routing.

See also: Runbooks, App Integration.

Dragonfly degraded or active Service empty

Symptoms: status.dragonfly.phase=Degraded, bloodraven_dragonfly_site_up==0, orders-dragonfly has no endpoints, or Events mention DragonflyPromotionFailed.

Likely causes: Dragonfly pod down, NetworkPolicy blocking port 6379 or admin port 9999, auth Secret mismatch, target replica syncing/loading, stale master kept out of the active Service, or snapshot restore in progress.

Inspect:

kubectl get mysqlfailovergroup orders -n orders \
-o jsonpath='{.status.dragonfly}{"\n"}'
kubectl get pod -n orders -l app.kubernetes.io/name=dragonfly --show-labels
kubectl get endpoints orders-dragonfly -n orders
kubectl get events -n orders --sort-by=.lastTimestamp | grep -i dragonfly

Safe remediation: restore the failed Dragonfly pod or network path first. Do not force a stale master back into the active Service unless you have decided its cache/session contents are authoritative. MySQL remains authoritative for durable data.

See also: Runbooks, App Integration, Monitoring.

App still writing to old site

Symptoms: application errors or writes appear on the old primary after failover.

Likely causes: DNS cache too long, connection pool did not reconnect, app pinned a site Service, old site not fenced.

Inspect:

kubectl get mysqlfailovergroup orders -n orders -o jsonpath='{.status.activeSite}{"\n"}'
dig orders.az.example.com
kubectl get pods -n orders -o wide

Safe remediation: restart or drain affected app pods, lower DNS/client cache TTL, confirm old site is read-only.

See also: App Integration, Network Partitions.

Replication lag high

Symptoms: secondsBehindSource exceeds policy, alerts fire, replica may not be chosen for backups.

Likely causes: under-sized replica, slow network/storage, long transaction, backup load, SQL thread stopped.

Inspect:

kubectl get mysqlfailovergroup orders -n orders -o yaml
kubectl exec -n orders <replica-pod> -c mysql -- mysql -e 'SHOW REPLICA STATUS\G'

Safe remediation: reduce write load, investigate storage/network, restart replication only after recording the error.

See also: Monitoring, Durability And RPO.

Replication stopped

Symptoms: replica IO or SQL thread down, replicating=false.

Likely causes: credential drift, network failure, missing binlog, divergent transaction, MySQL error.

Inspect:

kubectl exec -n orders <replica-pod> -c mysql -- mysql -e 'SHOW REPLICA STATUS\G'
kubectl describe mysqlfailovergroup orders -n orders

Safe remediation: fix the underlying error. If GTIDs diverged, use the divergent old primary recovery runbook rather than skipping transactions.

See also: Runbooks.

Divergent transactions detected

Symptoms: old primary cannot rejoin; Events or logs mention divergent GTIDs.

Likely causes: old primary accepted writes that never reached the promoted site.

Inspect:

kubectl describe mysqlfailovergroup orders -n orders
kubectl get events -n orders --sort-by=.lastTimestamp | grep -i divergent

Safe remediation: preserve evidence, decide whether to extract missing data manually, then reclone the divergent site from the active primary.

See also: Runbooks, Durability And RPO.

Backup Job failing

Symptoms: MysqlBackup.status.phase=Failed, backup Job failed.

Likely causes: S3 credentials, bucket policy, PVC full, MySQL credentials, staging volume eviction, source replica lag.

Inspect:

kubectl describe mysqlbackup -n orders <backup-name>
kubectl logs -n orders job/<backup-job-name>
kubectl get events -n orders --sort-by=.lastTimestamp

Safe remediation: fix credentials/storage first, then create a new manual backup. Do not lower retention until at least one newer backup succeeds.

See also: S3 Backups, PVC Backups.

Restore Job failing

Symptoms: restore status is failed, restore Job exits non-zero.

Likely causes: wrong artifact path, missing decryption passphrase, S3/PVC access, incompatible MySQL shell/server version.

Inspect:

kubectl describe mysqlfailovergroup orders-restore -n orders
kubectl logs -n orders job/<restore-job-name>
kubectl get pods,pvc -n orders

Safe remediation: verify source artifact and credentials, then recreate the recovery group if bootstrap state is partial. Preserve failed restore logs.

See also: Backup And Restore, Backup Encryption.

Grafana dashboards show no data

Symptoms: dashboards import but panels are empty.

Likely causes: Prometheus is not scraping Bloodraven, wrong datasource variable, dashboard ConfigMaps in an unwatched namespace.

Inspect:

kubectl get servicemonitor -n bloodraven
kubectl port-forward -n bloodraven deploy/bloodraven 8080:8080
curl http://localhost:8080/metrics | grep '^bloodraven_'

Safe remediation: fix scraping first, then dashboard datasource selection.

See also: Prometheus Setup, Grafana Dashboards.

Operator logs show permission denied

Symptoms: reconcile errors mention forbidden Kubernetes API operations.

Likely causes: RBAC drift, namespace-scoped install missing permissions, Helm chart RBAC not updated with generated RBAC.

Inspect:

kubectl auth can-i list mysqlfailovergroups.shipstream.io --as=system:serviceaccount:bloodraven:bloodraven
kubectl describe clusterrole bloodraven
kubectl logs -n bloodraven deploy/bloodraven

Safe remediation: restore chart RBAC from the current release. Do not grant wildcard permissions unless your platform policy explicitly accepts it.

See also: Security Model, Production Install.