BuyWhere Production Disaster Recovery Runbook
Document Version: 1.0 Last Updated: 2026-04-18 Owner: Ops Team Classification: Internal - Confidential
Quick Reference
| Scenario | RTO | RPO | Primary Action |
|---|---|---|---|
| DB Primary Failure | < 5 min | < 1 hour | Promote replica via failover_replica.sh --promote |
| DB Replica Failure | < 30 min | N/A | Check replication, recreate replica if needed |
| Complete DC Loss | < 4 hours | < 24 hours | Restore from latest backup, redeploy infrastructure |
| API Service Down | < 15 min | N/A | Restart container, scale up replicas |
| Redis Failure | < 10 min | < 1 hour | Restart Redis, warm cache from scrapers |
| Data Corruption | < 2 hours | < 1 hour | Restore from backup, replay WAL if available |
System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (api.buywhere.ai) │
└────────────────────────────┬────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ API │ │ API │ │ API │
│ Instance│ │ Instance │ │ Instance │
└────┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│PgBouncer│ │ Redis │ │ MCP │
│(pooler) │ │(cache) │ │ Service │
└────┬────┘ └───────────┘ └───────────┘
│
┌────▼────┐
│ Primary │
│ DB │◄────────── Streaming Replication ──────────┐
└────┬────┘ │
│ ┌────▼────┐
│ │ Replica │
│ │ DB │
│ └─────────┘
│
┌────▼────────────────────────────────────────────────────┐
│ Backup Volume (hourly/daily/weekly) │
└─────────────────────────────────────────────────────────┘
Components
| Component | Version | Purpose | Failover |
|---|---|---|---|
| API Service | FastAPI | Core application | Restart / Scale |
| PostgreSQL | 16-alpine | Primary database | Promote replica |
| PgBouncer | latest | Connection pooling | Restart |
| Redis | 7-alpine | Caching layer | Restart, warm cache |
| MCP Service | latest | Model Context Protocol | Restart |
Backup Strategy
Current Implementation
Backup Schedule:
├── Hourly (24 retained) - /var/backups/buywhere/hourly/
├── Daily (7 retained) - /var/backups/buywhere/daily/
├── Weekly (4 retained) - /var/backups/buywhere/weekly/
└── WAL Archive - /var/backups/buywhere/wal/ (if configured)
Verification
- Automated Verification: Every 7 days via
verify_backup.sh - Prometheus Metrics:
/var/lib/node_exporter/textfile_collector/backup_verification.prom - Alerting:
BackupVerificationFailed,BackupMissing,BackupTooOld
Backup Scripts
| Script | Location | Purpose |
|---|---|---|
backup.sh | /app/scripts/backup.sh | Create pg_dump backups |
verify_backup.sh | /app/scripts/verify_backup.sh | Verify backup integrity |
monitor_replication_lag.py | /app/scripts/monitor_replication_lag.py | Monitor DB replication |
Monitoring & Alerting
Prometheus Alerts
| Alert | Severity | Action |
|---|---|---|
APIDown | Critical | Check API container health |
HighErrorRate | Critical | Investigate error logs |
DatabasePoolExhausted | Warning | Check PgBouncer stats |
ReplicationLagCritical | Critical | Run failover if primary dead |
BackupVerificationFailed | Critical | Check backup service |
MCPServerDown | Critical | Restart MCP container |
Log Aggregation
- Loki:
k8s/staging/loki-configmap.yaml - Promtail:
k8s/staging/promtail-configmap.yaml - Grafana Dashboards:
grafana/provisioning/dashboards/
Recovery Procedures
1. Database Primary Failure
Symptoms:
ReplicationDisconnectedalert firespg_replication_lag_secondsshows -1 or replica disconnected- API returns database connection errors
Immediate Actions:
# 1. Verify primary is truly down
pg_isready -h db -p 5432 -U buywhere
# 2. Check replica health
./scripts/failover_replica.sh --check-only
# 3. If replica is healthy, promote it
./scripts/failover_replica.sh --promote
# 4. Verify promotion
pg_isready -h db_replica -p 5432 -U buywhere
Post-Failover:
# 1. Update connection strings in environment
# Set DATABASE_URL to point to new primary (db_replica)
# 2. Restart API services
docker-compose -f docker-compose.prod.yml restart api scraper-scheduler
# 3. Verify replication is configured for the new primary
# (A new replica will need to be provisioned)
# 4. Monitor for 30 minutes
watch -n5 'pg_replication_lag_seconds'
Escalation:
- Page: @ops-primary-oncall
- Slack:
#incidents - Duration target: < 5 minutes
2. Database Replica Failure
Symptoms:
ReplicationLagCriticalalert- Replica not responding to
pg_isready
Recovery:
# 1. Check replica status
docker-compose -f docker-compose.prod.yml logs db_replica
# 2. If replica container is running but unhealthy, restart it
docker-compose -f docker-compose.prod.yml restart db_replica
# 3. If container is corrupted, recreate from primary
docker-compose -f docker-compose.prod.yml up -d db_replica
# 4. Wait for replication to catch up
watch -n10 'PGPASSWORD=buywhere psql -h db -U buywhere -d catalog -c "SELECT now() - pg_last_xact_replay_timestamp();"'
Escalation:
- Duration target: < 30 minutes
3. Complete Data Center Loss
Prerequisites:
- Backups available in off-site storage
- Infrastructure as Code (k8s manifests in
k8s/)
Recovery Steps:
# Phase 1: Infrastructure Provisioning (< 1 hour)
# 1. Spin up new hosts or use pre-provisioned DR environment
# 2. Clone repository
git clone https://github.com/buywhere/buywhere-api.git
cd buywhere-api
# 3. Restore secrets
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=<user> \
--docker-password=<token> \
-n production
# Phase 2: Database Restore (< 2 hours)
# 4. Identify latest good backup
BACKUP=$(ls -t /var/backups/buywhere/daily/*.backup.gz | head -1)
# 5. Restore database
./scripts/backup.sh restore $BACKUP catalog
# 6. Verify restore
psql -h localhost -U buywhere -d catalog -c "SELECT COUNT(*) FROM products;"
# Phase 3: Service Deployment (< 1 hour)
# 7. Deploy using Docker Compose
docker-compose -f docker-compose.prod.yml up -d
# 8. Verify all services healthy
docker-compose -f docker-compose.prod.yml ps
# 9. Run smoke tests
curl -f https://api.buywhere.ai/health
RTO Target: < 4 hours RPO: < 24 hours (latest daily backup)
4. API Service Failure
Symptoms:
APIDownorExternalAPIDownalert- HTTP 5xx errors from health endpoint
Recovery:
# 1. Check container status
docker-compose -f docker-compose.prod.yml ps api
# 2. View logs
docker-compose -f docker-compose.prod.yml logs api --tail=100
# 3. Restart API container
docker-compose -f docker-compose.prod.yml restart api
# 4. If restart doesn't help, check resource usage
docker stats
# 5. Scale up if needed
docker-compose -f docker-compose.prod.yml up -d --scale api=3
# 6. Verify health
curl -f https://api.buywhere.ai/health
Alternative (Kubernetes):
kubectl rollout restart deployment/buywhere-api -n production
kubectl get pods -n production -l app=buywhere-api
5. Redis Failure
Symptoms:
- Cache-related errors in application logs
REDIS_URLconnection failures
Recovery:
# 1. Check Redis status
docker-compose -f docker-compose.prod.yml ps redis
redis-cli -h localhost ping
# 2. Restart Redis
docker-compose -f docker-compose.prod.yml restart redis
# 3. Warm cache (trigger scrapers to repopulate)
docker-compose -f docker-compose.prod.yml restart scraper-scheduler
# 4. Monitor cache hit rate
# Check Grafana dashboard for cache hit rate metrics
Data Loss: Up to 1 hour of cache (Redis persistence enabled)
6. PgBouncer Failure
Symptoms:
DatabasePoolExhaustedalert- API returns connection timeout errors
Recovery:
# 1. Check PgBouncer status
docker-compose -f docker-compose.prod.yml logs pgbouncer --tail=50
# 2. Restart PgBouncer
docker-compose -f docker-compose.prod.yml restart pgbouncer
# 3. Verify connections
psql -h localhost -p 5435 -U buywhere -d catalog -c "SELECT 1;"
# 4. If still failing, check PgBouncer config
docker-compose -f docker-compose.prod.yml exec pgbouncer sh -c 'echo "show pools" | nc localhost 5432'
7. Backup Failure / Corruption
Symptoms:
BackupVerificationFailedalertBackupMissingalert
Investigation:
# 1. Check backup service logs
docker-compose -f docker-compose.prod.yml logs backup-cron --tail=100
docker-compose -f docker-compose.prod.yml logs backup-verify-cron --tail=100
# 2. Check disk space
df -h /var/backups
# 3. Manually run backup
docker-compose -f docker-compose.prod.yml exec backup-cron /app/scripts/backup.sh backup hourly
# 4. Verify backup
docker-compose -f docker-compose.prod.yml exec backup-verify-cron /app/scripts/verify_backup.sh verify hourly
Recovery:
- If disk space issue: Expand volume, clean up old backups
- If backup script failure: Check PostgreSQL connectivity
- If corruption: Restore from last known good backup
8. Security Incident (Compromised Credentials)
Symptoms:
- Unauthorized access detected
- Suspicious queries in logs
- Unexpected data modifications
Immediate Actions:
# 1. Rotate all credentials IMMEDIATELY
# Trigger secrets rotation workflow
gh workflow run secrets-rotation.yml
# 2. Revoke and regenerate all database passwords
# Connect to primary directly (bypass PgBouncer)
psql -h db -p 5432 -U buywhere -d catalog
# In psql:
# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename='buywhere';
# ALTER USER buywhere WITH PASSWORD 'new_secure_password';
# 3. Update all services with new credentials
# Update DATABASE_URL, JWT_SECRET_KEY, etc. in all services
# 4. Restart all services to pick up new credentials
docker-compose -f docker-compose.prod.yml restart
# 5. Check for unauthorized access in logs
grep -E "ERROR|UNAUTHORIZED|FAILED" /app/logs/*.log
# 6. Notify security team
Escalation:
- Immediate: @security-oncall
- Document incident in
#security-incidents
Post-Incident Procedures
1. Incident Documentation
Within 24 hours of incident resolution:
# Incident Report: [TITLE]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** Critical / High / Medium / Low
**Root Cause:** [Brief description]
## Timeline
- HH:MM - Event
- HH:MM - Action taken
- HH:MM - Resolution
## Impact
- Users affected: [number]
- Data loss: [yes/no, amount]
- Downtime: [duration]
## Action Items
- [ ] Specific actionable item 1
- [ ] Specific actionable item 2
## Lessons Learned
Document specific improvements to systems, processes, or documentation based on the incident
2. Follow-up Actions
- Update runbook with lessons learned
- Implement any identified improvements
- Schedule post-incident review meeting
- Update monitoring/alerting if gaps found
Contact Information
| Role | Contact | Escalation Path |
|---|---|---|
| Primary On-Call | @ops-primary-oncall | PagerDuty |
| Secondary On-Call | @ops-secondary-oncall | PagerDuty |
| Engineering Manager | bolt@buywhere.ai | Slack DM |
| Security Team | security@buywhere.ai | #security-incidents |
| Database Expert | dba@buywhere.ai |
Appendix
A. Important Commands Reference
# Database
pg_isready -h db -p 5432 -U buywhere
pg_replication_lag_seconds # Prometheus metric
docker-compose -f docker-compose.prod.yml logs db
docker-compose -f docker-compose.prod.yml logs db_replica
# Backups
./scripts/backup.sh list hourly
./scripts/backup.sh restore <backup_file>
./scripts/verify_backup.sh verify all
# Containers
docker-compose -f docker-compose.prod.yml ps
docker-compose -f docker-compose.prod.yml restart <service>
docker stats
# Health Checks
curl -f http://localhost:8000/health
curl -f http://localhost:8080/health # MCP
B. Runbook Maintenance
- Review Frequency: Quarterly
- Last Reviewed: 2026-04-18
- Next Review: 2026-07-18
- Version Control: Git (this file is in
docs/disaster_recovery_runbook.md)
C. Testing Schedule
| Test | Frequency | Method |
|---|---|---|
| Backup Restoration | Monthly | Restore to test environment |
| Failover Drill | Quarterly | Promote replica in staging |
| Full DR Exercise | Annually | Simulate complete DC loss |
End of Document