disaster_recovery_runbook

BuyWhere Production Disaster Recovery Runbook

Document Version: 1.0 Last Updated: 2026-04-18 Owner: Ops Team Classification: Internal - Confidential

Quick Reference

Scenario	RTO	RPO	Primary Action
DB Primary Failure	< 5 min	< 1 hour	Promote replica via `failover_replica.sh --promote`
DB Replica Failure	< 30 min	N/A	Check replication, recreate replica if needed
Complete DC Loss	< 4 hours	< 24 hours	Restore from latest backup, redeploy infrastructure
API Service Down	< 15 min	N/A	Restart container, scale up replicas
Redis Failure	< 10 min	< 1 hour	Restart Redis, warm cache from scrapers
Data Corruption	< 2 hours	< 1 hour	Restore from backup, replay WAL if available

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Load Balancer                             │
│                    (api.buywhere.ai)                             │
└────────────────────────────┬────────────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
    ┌────▼────┐        ┌─────▼─────┐       ┌─────▼─────┐
    │  API    │        │    API    │       │   API     │
    │ Instance│        │  Instance │       │  Instance │
    └────┬────┘        └─────┬─────┘       └─────┬─────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
    ┌────▼────┐        ┌─────▼─────┐       ┌─────▼─────┐
    │PgBouncer│        │  Redis   │       │   MCP     │
    │(pooler) │        │(cache)   │       │ Service   │
    └────┬────┘        └───────────┘       └───────────┘
         │
    ┌────▼────┐
    │ Primary │
    │   DB    │◄────────── Streaming Replication ──────────┐
    └────┬────┘                                           │
         │                                           ┌────▼────┐
         │                                           │ Replica │
         │                                           │   DB   │
         │                                           └─────────┘
         │
    ┌────▼────────────────────────────────────────────────────┐
    │              Backup Volume (hourly/daily/weekly)        │
    └─────────────────────────────────────────────────────────┘

Components

Component	Version	Purpose	Failover
API Service	FastAPI	Core application	Restart / Scale
PostgreSQL	16-alpine	Primary database	Promote replica
PgBouncer	latest	Connection pooling	Restart
Redis	7-alpine	Caching layer	Restart, warm cache
MCP Service	latest	Model Context Protocol	Restart

Backup Strategy

Current Implementation

Backup Schedule:
├── Hourly (24 retained)  - /var/backups/buywhere/hourly/
├── Daily (7 retained)     - /var/backups/buywhere/daily/
├── Weekly (4 retained)    - /var/backups/buywhere/weekly/
└── WAL Archive           - /var/backups/buywhere/wal/ (if configured)

Verification

Automated Verification: Every 7 days via verify_backup.sh
Prometheus Metrics: /var/lib/node_exporter/textfile_collector/backup_verification.prom
Alerting: BackupVerificationFailed, BackupMissing, BackupTooOld

Backup Scripts

Script	Location	Purpose
`backup.sh`	`/app/scripts/backup.sh`	Create pg_dump backups
`verify_backup.sh`	`/app/scripts/verify_backup.sh`	Verify backup integrity
`monitor_replication_lag.py`	`/app/scripts/monitor_replication_lag.py`	Monitor DB replication

Monitoring & Alerting

Prometheus Alerts

Alert	Severity	Action
`APIDown`	Critical	Check API container health
`HighErrorRate`	Critical	Investigate error logs
`DatabasePoolExhausted`	Warning	Check PgBouncer stats
`ReplicationLagCritical`	Critical	Run failover if primary dead
`BackupVerificationFailed`	Critical	Check backup service
`MCPServerDown`	Critical	Restart MCP container

Log Aggregation

Loki: k8s/staging/loki-configmap.yaml
Promtail: k8s/staging/promtail-configmap.yaml
Grafana Dashboards: grafana/provisioning/dashboards/

Recovery Procedures

1. Database Primary Failure

Symptoms:

ReplicationDisconnected alert fires
pg_replication_lag_seconds shows -1 or replica disconnected
API returns database connection errors

Immediate Actions:

# 1. Verify primary is truly down
pg_isready -h db -p 5432 -U buywhere

# 2. Check replica health
./scripts/failover_replica.sh --check-only

# 3. If replica is healthy, promote it
./scripts/failover_replica.sh --promote

# 4. Verify promotion
pg_isready -h db_replica -p 5432 -U buywhere

Post-Failover:

# 1. Update connection strings in environment
# Set DATABASE_URL to point to new primary (db_replica)

# 2. Restart API services
docker-compose -f docker-compose.prod.yml restart api scraper-scheduler

# 3. Verify replication is configured for the new primary
# (A new replica will need to be provisioned)

# 4. Monitor for 30 minutes
watch -n5 'pg_replication_lag_seconds'

Escalation:

Page: @ops-primary-oncall
Slack: #incidents
Duration target: < 5 minutes

2. Database Replica Failure

Symptoms:

ReplicationLagCritical alert
Replica not responding to pg_isready

Recovery:

# 1. Check replica status
docker-compose -f docker-compose.prod.yml logs db_replica

# 2. If replica container is running but unhealthy, restart it
docker-compose -f docker-compose.prod.yml restart db_replica

# 3. If container is corrupted, recreate from primary
docker-compose -f docker-compose.prod.yml up -d db_replica

# 4. Wait for replication to catch up
watch -n10 'PGPASSWORD=buywhere psql -h db -U buywhere -d catalog -c "SELECT now() - pg_last_xact_replay_timestamp();"'

Escalation:

Duration target: < 30 minutes

3. Complete Data Center Loss

Prerequisites:

Backups available in off-site storage
Infrastructure as Code (k8s manifests in k8s/)

Recovery Steps:

# Phase 1: Infrastructure Provisioning (< 1 hour)
# 1. Spin up new hosts or use pre-provisioned DR environment

# 2. Clone repository
git clone https://github.com/buywhere/buywhere-api.git
cd buywhere-api

# 3. Restore secrets
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=<user> \
  --docker-password=<token> \
  -n production

# Phase 2: Database Restore (< 2 hours)
# 4. Identify latest good backup
BACKUP=$(ls -t /var/backups/buywhere/daily/*.backup.gz | head -1)

# 5. Restore database
./scripts/backup.sh restore $BACKUP catalog

# 6. Verify restore
psql -h localhost -U buywhere -d catalog -c "SELECT COUNT(*) FROM products;"

# Phase 3: Service Deployment (< 1 hour)
# 7. Deploy using Docker Compose
docker-compose -f docker-compose.prod.yml up -d

# 8. Verify all services healthy
docker-compose -f docker-compose.prod.yml ps

# 9. Run smoke tests
curl -f https://api.buywhere.ai/health

RTO Target: < 4 hours RPO: < 24 hours (latest daily backup)

4. API Service Failure

Symptoms:

APIDown or ExternalAPIDown alert
HTTP 5xx errors from health endpoint

Recovery:

# 1. Check container status
docker-compose -f docker-compose.prod.yml ps api

# 2. View logs
docker-compose -f docker-compose.prod.yml logs api --tail=100

# 3. Restart API container
docker-compose -f docker-compose.prod.yml restart api

# 4. If restart doesn't help, check resource usage
docker stats

# 5. Scale up if needed
docker-compose -f docker-compose.prod.yml up -d --scale api=3

# 6. Verify health
curl -f https://api.buywhere.ai/health

Alternative (Kubernetes):

kubectl rollout restart deployment/buywhere-api -n production
kubectl get pods -n production -l app=buywhere-api

5. Redis Failure

Symptoms:

Cache-related errors in application logs
REDIS_URL connection failures

Recovery:

# 1. Check Redis status
docker-compose -f docker-compose.prod.yml ps redis
redis-cli -h localhost ping

# 2. Restart Redis
docker-compose -f docker-compose.prod.yml restart redis

# 3. Warm cache (trigger scrapers to repopulate)
docker-compose -f docker-compose.prod.yml restart scraper-scheduler

# 4. Monitor cache hit rate
# Check Grafana dashboard for cache hit rate metrics

Data Loss: Up to 1 hour of cache (Redis persistence enabled)

6. PgBouncer Failure

Symptoms:

DatabasePoolExhausted alert
API returns connection timeout errors

Recovery:

# 1. Check PgBouncer status
docker-compose -f docker-compose.prod.yml logs pgbouncer --tail=50

# 2. Restart PgBouncer
docker-compose -f docker-compose.prod.yml restart pgbouncer

# 3. Verify connections
psql -h localhost -p 5435 -U buywhere -d catalog -c "SELECT 1;"

# 4. If still failing, check PgBouncer config
docker-compose -f docker-compose.prod.yml exec pgbouncer sh -c 'echo "show pools" | nc localhost 5432'

7. Backup Failure / Corruption

Symptoms:

BackupVerificationFailed alert
BackupMissing alert

Investigation:

# 1. Check backup service logs
docker-compose -f docker-compose.prod.yml logs backup-cron --tail=100
docker-compose -f docker-compose.prod.yml logs backup-verify-cron --tail=100

# 2. Check disk space
df -h /var/backups

# 3. Manually run backup
docker-compose -f docker-compose.prod.yml exec backup-cron /app/scripts/backup.sh backup hourly

# 4. Verify backup
docker-compose -f docker-compose.prod.yml exec backup-verify-cron /app/scripts/verify_backup.sh verify hourly

Recovery:

If disk space issue: Expand volume, clean up old backups
If backup script failure: Check PostgreSQL connectivity
If corruption: Restore from last known good backup

8. Security Incident (Compromised Credentials)

Symptoms:

Unauthorized access detected
Suspicious queries in logs
Unexpected data modifications

Immediate Actions:

# 1. Rotate all credentials IMMEDIATELY
# Trigger secrets rotation workflow
gh workflow run secrets-rotation.yml

# 2. Revoke and regenerate all database passwords
# Connect to primary directly (bypass PgBouncer)
psql -h db -p 5432 -U buywhere -d catalog

# In psql:
# SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename='buywhere';
# ALTER USER buywhere WITH PASSWORD 'new_secure_password';

# 3. Update all services with new credentials
# Update DATABASE_URL, JWT_SECRET_KEY, etc. in all services

# 4. Restart all services to pick up new credentials
docker-compose -f docker-compose.prod.yml restart

# 5. Check for unauthorized access in logs
grep -E "ERROR|UNAUTHORIZED|FAILED" /app/logs/*.log

# 6. Notify security team

Escalation:

Immediate: @security-oncall
Document incident in #security-incidents

Post-Incident Procedures

1. Incident Documentation

Within 24 hours of incident resolution:

# Incident Report: [TITLE]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** Critical / High / Medium / Low
**Root Cause:** [Brief description]

## Timeline
- HH:MM - Event
- HH:MM - Action taken
- HH:MM - Resolution

## Impact
- Users affected: [number]
- Data loss: [yes/no, amount]
- Downtime: [duration]

## Action Items
- [ ] Specific actionable item 1
- [ ] Specific actionable item 2

## Lessons Learned
Document specific improvements to systems, processes, or documentation based on the incident

2. Follow-up Actions

Update runbook with lessons learned
Implement any identified improvements
Schedule post-incident review meeting
Update monitoring/alerting if gaps found

Contact Information

Role	Contact	Escalation Path
Primary On-Call	@ops-primary-oncall	PagerDuty
Secondary On-Call	@ops-secondary-oncall	PagerDuty
Engineering Manager	bolt@buywhere.ai	Slack DM
Security Team	security@buywhere.ai	`#security-incidents`
Database Expert	dba@buywhere.ai	Email

Appendix

A. Important Commands Reference

# Database
pg_isready -h db -p 5432 -U buywhere
pg_replication_lag_seconds  # Prometheus metric
docker-compose -f docker-compose.prod.yml logs db
docker-compose -f docker-compose.prod.yml logs db_replica

# Backups
./scripts/backup.sh list hourly
./scripts/backup.sh restore <backup_file>
./scripts/verify_backup.sh verify all

# Containers
docker-compose -f docker-compose.prod.yml ps
docker-compose -f docker-compose.prod.yml restart <service>
docker stats

# Health Checks
curl -f http://localhost:8000/health
curl -f http://localhost:8080/health  # MCP

B. Runbook Maintenance

Review Frequency: Quarterly
Last Reviewed: 2026-04-18
Next Review: 2026-07-18
Version Control: Git (this file is in docs/disaster_recovery_runbook.md)

C. Testing Schedule

Test	Frequency	Method
Backup Restoration	Monthly	Restore to test environment
Failover Drill	Quarterly	Promote replica in staging
Full DR Exercise	Annually	Simulate complete DC loss

End of Document