Database Failover Runbook
Overview
This document describes the procedure for failing over from the primary PostgreSQL database to the read replica in case of primary failure.
Architecture
┌─────────────┐
│ Clients │
└──────┬──────┘
│
┌──────▼──────┐
│ PgBouncer │
└──────┬──────┘
│
┌────────────┴────────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Primary │◄──────────│ Replica │
│ (writer) │ streaming │ (standby) │
└─────────────┘ └──────────────┘
Prerequisites
- PostgreSQL streaming replication configured between primary and replica
scripts/failover_replica.shavailable on the serverscripts/monitor_replication_lag.pyrunning as a cron job or systemd service- Alerts configured in Prometheus (
prometheus_alerts.yml)
Monitoring
Check Replication Status
# Check from primary
docker exec db psql -U buywhere -d catalog -c "SELECT * FROM pg_stat_replication;"
# Check from replica
docker exec db_replica psql -U buywhere -d catalog -c "SELECT pg_is_in_recovery();"
# Check lag
docker exec db_replica psql -U buywhere -d catalog -c "SELECT now() - pg_last_xact_replay_timestamp();"
Run Monitoring Script
# One-time check
python scripts/monitor_replication_lag.py --once --replica-host db_replica
# Continuous monitoring
python scripts/monitor_replication_lag.py --replica-host db_replica --interval 30
Failover Procedure
Step 1: Verify Primary Failure
Confirm primary is truly unavailable:
# Try to connect to primary
pg_isready -h db -p 5432 -U buywhere
# Check API health
curl https://api.buywhere.ai/health
Step 2: Check Replica Health
Before promoting, verify replica is healthy and caught up:
# Check replication lag (should be < 5 seconds)
./scripts/failover_replica.sh --check-only
# Or manually
docker exec db_replica psql -U buywhere -d catalog -c \
"SELECT now() - pg_last_xact_replay_timestamp() AS lag;"
If lag is too high (> 30 seconds), investigate before proceeding.
Step 3: Promote Replica
Execute the failover:
# Promote replica to primary
./scripts/failover_replica.sh --promote
This script will:
- Verify replica health
- Stop writes to old primary (if reachable)
- Promote replica to primary
- Update DATABASE_URL configuration
- Restart API services
Step 4: Verify Failover
Check that services are working:
# Verify API is up
curl https://api.buywhere.ai/health
# Check logs
docker-compose -f docker-compose.prod.yml logs api --tail=50
# Verify database is accepting writes
docker exec db_replica psql -U buywhere -d catalog -c \
"CREATE TABLE IF NOT EXISTS failover_test (id SERIAL);" && \
docker exec db_replica psql -U buywhere -d catalog -c \
"DROP TABLE failover_test;"
Step 5: Notify
- Update status page if applicable
- Notify team in Slack/Teams
- Create incident report
Post-Failover Actions
Rebuild Old Primary as New Replica
Once the old primary is available again, rebuild it as a replica:
# On the old primary, reset PostgreSQL
docker-compose -f docker-compose.prod.yml stop db
docker volume rm buywhere-api_postgres_data
docker-compose -f docker-compose.prod.yml up -d db
# Wait for initialization, then configure as replica
# (The setup_replica.sh script will handle this automatically)
Update DNS/Configuration
Ensure all services are using the correct DATABASE_URL:
# Check current primary
docker exec db_replica psql -U buywhere -d catalog -c \
"SELECT pg_is_in_recovery();"
# Verify environment variables
docker-compose -f docker-compose.prod.yml exec api env | grep DATABASE
Rollback Procedure
If failover fails or issues are discovered:
- Do not attempt to restore old primary as primary without verifying replication
- Check API error logs:
docker-compose logs api - Verify replica promotion:
docker exec db_replica psql -c "SELECT pg_is_in_recovery();" - If needed, contact on-call DBA
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| Replication lag | > 2s | > 5s |
| Replica connection | down | down |
| Disk usage (replica) | > 80% | > 90% |
Contacts
- Primary on-call: See PagerDuty schedule
- DBA team: dba@buywhere.ai
- Infrastructure: infra@buywhere.ai