← Back to documentation

disaster_recovery_runbook

BuyWhere Production Disaster Recovery Runbook

Document Version: 1.0 Last Updated: 2026-04-18 Owner: Ops Team Classification: Internal - Confidential


Quick Reference

ScenarioRTORPOPrimary Action
DB Primary Failure< 5 min< 1 hourPromote replica via failover_replica.sh --promote
DB Replica Failure< 30 minN/ACheck replication, recreate replica if needed
Complete DC Loss< 4 hours< 24 hoursRestore from latest backup, redeploy infrastructure
API Service Down< 15 minN/ARestart container, scale up replicas
Redis Failure< 10 min< 1 hourRestart Redis, warm cache from scrapers
Data Corruption< 2 hours< 1 hourRestore from backup, replay WAL if available

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Load Balancer                             │
│                    (api.buywhere.ai)                             │
└────────────────────────────┬────────────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
    ┌────▼────┐        ┌─────▼─────┐       ┌─────▼─────┐
    │  API    │        │    API    │       │   API     │
    │ Instance│        │  Instance │       │  Instance │
    └────┬────┘        └─────┬─────┘       └─────┬─────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
    ┌────▼────┐        ┌─────▼─────┐       ┌─────▼─────┐
    │PgBouncer│        │  Redis   │       │   MCP     │
    │(pooler) │        │(cache)   │       │ Service   │
    └────┬────┘        └───────────┘       └───────────┘
         │
    ┌────▼────┐
    │ Primary │
    │   DB    │◄────────── Streaming Replication ──────────┐
    └────┬────┘                                           │
         │                                           ┌────▼────┐
         │                                           │ Replica │
         │                                           │   DB   │
         │                                           └─────────┘
         │
    ┌────▼────────────────────────────────────────────────────┐
    │              Backup Volume (hourly/daily/weekly)        │
    └─────────────────────────────────────────────────────────┘

Components

ComponentVersionPurposeFailover
API ServiceFastAPICore applicationRestart / Scale
PostgreSQL16-alpinePrimary databasePromote replica
PgBouncerlatestConnection poolingRestart
Redis7-alpineCaching layerRestart, warm cache
MCP ServicelatestModel Context ProtocolRestart

Backup Strategy

Current Implementation

Backup Schedule:
├── Hourly (24 retained)  - /var/backups/buywhere/hourly/
├── Daily (7 retained)     - /var/backups/buywhere/daily/
├── Weekly (4 retained)    - /var/backups/buywhere/weekly/
└── WAL Archive           - /var/backups/buywhere/wal/ (if configured)

Verification

  • Automated Verification: Every 7 days via verify_backup.sh
  • Prometheus Metrics: /var/lib/node_exporter/textfile_collector/backup_verification.prom
  • Alerting: BackupVerificationFailed, BackupMissing, BackupTooOld

Backup Scripts

ScriptLocationPurpose
backup.sh/app/scripts/backup.shCreate pg_dump backups
verify_backup.sh/app/scripts/verify_backup.shVerify backup integrity
monitor_replication_lag.py/app/scripts/monitor_replication_lag.pyMonitor DB replication

Monitoring & Alerting

Prometheus Alerts

AlertSeverityAction
APIDownCriticalCheck API container health
HighErrorRateCriticalInvestigate error logs
DatabasePoolExhaustedWarningCheck PgBouncer stats
ReplicationLagCriticalCriticalRun failover if primary dead
BackupVerificationFailedCriticalCheck backup service
MCPServerDownCriticalRestart MCP container

Log Aggregation

  • Loki: k8s/staging/loki-configmap.yaml
  • Promtail: k8s/staging/promtail-configmap.yaml
  • Grafana Dashboards: grafana/provisioning/dashboards/

Recovery Procedures

1. Database Primary Failure

Symptoms:

  • ReplicationDisconnected alert fires
  • pg_replication_lag_seconds shows -1 or replica disconnected
  • API returns database connection errors

Immediate Actions:

# 1. Verify primary is truly down
pg_isready -h db -p 5432 -U buywhere

# 2. Check replica health
./scripts/failover_replica.sh --check-only

# 3. If replica is healthy, promote it
./scripts/failover_replica.sh --promote

# 4. Verify promotion
pg_isready -h db_replica -p 5432 -U buywhere

Post-Failover:

# 1. Update connection strings in environment
# Set DATABASE_URL to point to new primary (db_replica)

# 2. Restart API services
docker-compose -f docker-compose.prod.yml restart api scraper-scheduler

# 3. Verify replication is configured for the new primary
# (A new replica will need to be provisioned)

# 4. Monitor for 30 minutes
watch -n5 'pg_replication_lag_seconds'

Escalation:

  • Page: @ops-primary-oncall
  • Slack: #incidents
  • Duration target: < 5 minutes

2. Database Replica Failure

Symptoms:

  • ReplicationLagCritical alert
  • Replica not responding to pg_isready

Recovery:

# 1. Check replica status
docker-compose -f docker-compose.prod.yml logs db_replica

# 2. If replica container is running but unhealthy, restart it
docker-compose -f docker-compose.prod.yml restart db_replica

# 3. If container is corrupted, recreate from primary
docker-compose -f docker-compose.prod.yml up -d db_replica

# 4. Wait for replication to catch up
watch -n10 'PGPASSWORD=buywhere psql -h db -U buywhere -d catalog -c "SELECT now() - pg_last_xact_replay_timestamp();"'

Escalation:

  • Duration target: < 30 minutes

3. Complete Data Center Loss

Prerequisites:

  • Backups available in off-site storage
  • Infrastructure as Code (k8s manifests in k8s/)

Recovery Steps:

# Phase 1: Infrastructure Provisioning (< 1 hour)
# 1. Spin up new hosts or use pre-provisioned DR environment

# 2. Clone repository
git clone https://github.com/buywhere/buywhere-api.git
cd buywhere-api

# 3. Restore secrets
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=<user> \
  --docker-password=<token> \
  -n production

# Phase 2: Database Restore (< 2 hours)
# 4. Identify latest good backup
BACKUP=$(ls -t /var/backups/buywhere/daily/*.backup.gz | head -1)

# 5. Restore database
./scripts/backup.sh restore $BACKUP catalog

# 6. Verify restore
psql -h localhost -U buywhere -d catalog -c "SELECT COUNT(*) FROM products;"

# Phase 3: Service Deployment (< 1 hour)
# 7. Deploy using Docker Compose
docker-compose -f docker-compose.prod.yml up -d

# 8. Verify all services healthy
docker-compose -f docker-compose.prod.yml ps

# 9. Run smoke tests
curl -f https://api.buywhere.ai/health

RTO Target: < 4 hours RPO: < 24 hours (latest daily backup)


4. API Service Failure

Symptoms:

  • APIDown or ExternalAPIDown alert
  • HTTP 5xx errors from health endpoint

Recovery:

# 1. Check container status
docker-compose -f docker-compose.prod.yml ps api

# 2. View logs
docker-compose -f docker-compose.prod.yml logs api --tail=100

# 3. Restart API container
docker-compose -f docker-compose.prod.yml restart api

# 4. If restart doesn't help, check resource usage
docker stats

# 5. Scale up if needed
docker-compose -f docker-compose.prod.yml up -d --scale api=3

# 6. Verify health
curl -f https://api.buywhere.ai/health

Alternative (Kubernetes):

kubectl rollout restart deployment/buywhere-api -n production
kubectl get pods -n production -l app=buywhere-api

5. Redis Failure

Symptoms:

  • Cache-related errors in application logs
  • REDIS_URL connection failures

Recovery:

# 1. Check Redis status
docker-compose -f docker-compose.prod.yml ps redis
redis-cli -h localhost ping

# 2. Restart Redis
docker-compose -f docker-compose.prod.yml restart redis

# 3. Warm cache (trigger scrapers to repopulate)
docker-compose -f docker-compose.prod.yml restart scraper-scheduler

# 4. Monitor cache hit rate
# Check Grafana dashboard for cache hit rate metrics

Data Loss: Up to 1 hour of cache (Redis persistence enabled)


6. PgBouncer Failure

Symptoms:

  • DatabasePoolExhausted alert
  • API returns connection timeout errors

Recovery:

# 1. Check PgBouncer status
docker-compose -f docker-compose.prod.yml logs pgbouncer --tail=50

# 2. Restart PgBouncer
docker-compose -f docker-compose.prod.yml restart pgbouncer

# 3. Verify connections
psql -h localhost -p 5435 -U buywhere -d catalog -c "SELECT 1;"

# 4. If still failing, check PgBouncer config
docker-compose -f docker-compose.prod.yml exec pgbouncer sh -c 'echo "show pools" | nc localhost 5432'

7. Backup Failure / Corruption

Symptoms:

  • BackupVerificationFailed alert
  • BackupMissing alert

Investigation:

# 1. Check backup service logs
docker-compose -f docker-compose.prod.yml logs backup-cron --tail=100
docker-compose -f docker-compose.prod.yml logs backup-verify-cron --tail=100

# 2. Check disk space
df -h /var/backups

# 3. Manually run backup
docker-compose -f docker-compose.prod.yml exec backup-cron /app/scripts/backup.sh backup hourly

# 4. Verify backup
docker-compose -f docker-compose.prod.yml exec backup-verify-cron /app/scripts/verify_backup.sh verify hourly

Recovery:

  • If disk space issue: Expand volume, clean up old backups
  • If backup script failure: Check PostgreSQL connectivity
  • If corruption: Restore from last known good backup

8. Security Incident (Compromised Credentials)

Symptoms:

  • Unauthorized access detected
  • Suspicious queries in logs
  • Unexpected data modifications

Immediate Actions:

# 1. Rotate all credentials IMMEDIATELY
# Trigger secrets rotation workflow
gh workflow run secrets-rotation.yml

# 2. Revoke and regenerate all database passwords
# Connect to primary directly (bypass PgBouncer)
psql -h db -p 5432 -U buywhere -d catalog

# In psql:
# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename='buywhere';
# ALTER USER buywhere WITH PASSWORD 'new_secure_password';

# 3. Update all services with new credentials
# Update DATABASE_URL, JWT_SECRET_KEY, etc. in all services

# 4. Restart all services to pick up new credentials
docker-compose -f docker-compose.prod.yml restart

# 5. Check for unauthorized access in logs
grep -E "ERROR|UNAUTHORIZED|FAILED" /app/logs/*.log

# 6. Notify security team

Escalation:

  • Immediate: @security-oncall
  • Document incident in #security-incidents

Post-Incident Procedures

1. Incident Documentation

Within 24 hours of incident resolution:

# Incident Report: [TITLE]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** Critical / High / Medium / Low
**Root Cause:** [Brief description]

## Timeline
- HH:MM - Event
- HH:MM - Action taken
- HH:MM - Resolution

## Impact
- Users affected: [number]
- Data loss: [yes/no, amount]
- Downtime: [duration]

## Action Items
- [ ] Specific actionable item 1
- [ ] Specific actionable item 2

## Lessons Learned
Document specific improvements to systems, processes, or documentation based on the incident

2. Follow-up Actions

  • Update runbook with lessons learned
  • Implement any identified improvements
  • Schedule post-incident review meeting
  • Update monitoring/alerting if gaps found

Contact Information

RoleContactEscalation Path
Primary On-Call@ops-primary-oncallPagerDuty
Secondary On-Call@ops-secondary-oncallPagerDuty
Engineering Managerbolt@buywhere.aiSlack DM
Security Teamsecurity@buywhere.ai#security-incidents
Database Expertdba@buywhere.aiEmail

Appendix

A. Important Commands Reference

# Database
pg_isready -h db -p 5432 -U buywhere
pg_replication_lag_seconds  # Prometheus metric
docker-compose -f docker-compose.prod.yml logs db
docker-compose -f docker-compose.prod.yml logs db_replica

# Backups
./scripts/backup.sh list hourly
./scripts/backup.sh restore <backup_file>
./scripts/verify_backup.sh verify all

# Containers
docker-compose -f docker-compose.prod.yml ps
docker-compose -f docker-compose.prod.yml restart <service>
docker stats

# Health Checks
curl -f http://localhost:8000/health
curl -f http://localhost:8080/health  # MCP

B. Runbook Maintenance

  • Review Frequency: Quarterly
  • Last Reviewed: 2026-04-18
  • Next Review: 2026-07-18
  • Version Control: Git (this file is in docs/disaster_recovery_runbook.md)

C. Testing Schedule

TestFrequencyMethod
Backup RestorationMonthlyRestore to test environment
Failover DrillQuarterlyPromote replica in staging
Full DR ExerciseAnnuallySimulate complete DC loss

End of Document