Database Failover Runbook

Overview

This document describes the procedure for failing over from the primary PostgreSQL database to the read replica in case of primary failure.

Architecture

                    ┌─────────────┐
                    │   Clients   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  PgBouncer  │
                    └──────┬──────┘
                           │
              ┌────────────┴────────────┐
              │                         │
       ┌──────▼──────┐           ┌──────▼──────┐
       │   Primary   │◄──────────│   Replica    │
       │  (writer)    │ streaming │  (standby)   │
       └─────────────┘           └──────────────┘

Prerequisites

  • PostgreSQL streaming replication configured between primary and replica
  • scripts/failover_replica.sh available on the server
  • scripts/monitor_replication_lag.py running as a cron job or systemd service
  • Alerts configured in Prometheus (prometheus_alerts.yml)

Monitoring

Check Replication Status

# Check from primary
docker exec db psql -U buywhere -d catalog -c "SELECT * FROM pg_stat_replication;"

# Check from replica
docker exec db_replica psql -U buywhere -d catalog -c "SELECT pg_is_in_recovery();"

# Check lag
docker exec db_replica psql -U buywhere -d catalog -c "SELECT now() - pg_last_xact_replay_timestamp();"

Run Monitoring Script

# One-time check
python scripts/monitor_replication_lag.py --once --replica-host db_replica

# Continuous monitoring
python scripts/monitor_replication_lag.py --replica-host db_replica --interval 30

Failover Procedure

Step 1: Verify Primary Failure

Confirm primary is truly unavailable:

# Try to connect to primary
pg_isready -h db -p 5432 -U buywhere

# Check API health
curl https://api.buywhere.ai/health

Step 2: Check Replica Health

Before promoting, verify replica is healthy and caught up:

# Check replication lag (should be < 5 seconds)
./scripts/failover_replica.sh --check-only

# Or manually
docker exec db_replica psql -U buywhere -d catalog -c \
  "SELECT now() - pg_last_xact_replay_timestamp() AS lag;"

If lag is too high (> 30 seconds), investigate before proceeding.

Step 3: Promote Replica

Execute the failover:

# Promote replica to primary
./scripts/failover_replica.sh --promote

This script will:

  1. Verify replica health
  2. Stop writes to old primary (if reachable)
  3. Promote replica to primary
  4. Update DATABASE_URL configuration
  5. Restart API services

Step 4: Verify Failover

Check that services are working:

# Verify API is up
curl https://api.buywhere.ai/health

# Check logs
docker-compose -f docker-compose.prod.yml logs api --tail=50

# Verify database is accepting writes
docker exec db_replica psql -U buywhere -d catalog -c \
  "CREATE TABLE IF NOT EXISTS failover_test (id SERIAL);" && \
  docker exec db_replica psql -U buywhere -d catalog -c \
  "DROP TABLE failover_test;"

Step 5: Notify

  • Update status page if applicable
  • Notify team in Slack/Teams
  • Create incident report

Post-Failover Actions

Rebuild Old Primary as New Replica

Once the old primary is available again, rebuild it as a replica:

# On the old primary, reset PostgreSQL
docker-compose -f docker-compose.prod.yml stop db
docker volume rm buywhere-api_postgres_data
docker-compose -f docker-compose.prod.yml up -d db

# Wait for initialization, then configure as replica
# (The setup_replica.sh script will handle this automatically)

Update DNS/Configuration

Ensure all services are using the correct DATABASE_URL:

# Check current primary
docker exec db_replica psql -U buywhere -d catalog -c \
  "SELECT pg_is_in_recovery();"

# Verify environment variables
docker-compose -f docker-compose.prod.yml exec api env | grep DATABASE

Rollback Procedure

If failover fails or issues are discovered:

  1. Do not attempt to restore old primary as primary without verifying replication
  2. Check API error logs: docker-compose logs api
  3. Verify replica promotion: docker exec db_replica psql -c "SELECT pg_is_in_recovery();"
  4. If needed, contact on-call DBA

Alert Thresholds

MetricWarningCritical
Replication lag> 2s> 5s
Replica connectiondowndown
Disk usage (replica)> 80%> 90%

Contacts