Database Backup Restore Runbook
Overview
This document describes procedures for verifying and restoring PostgreSQL database backups. Automated backup verification runs weekly via the backup-verify-cron service.
Backup Architecture
/var/backups/buywhere/
├── hourly/ # Last 24 hourly backups
├── daily/ # Last 7 daily backups
├── weekly/ # Last 4 weekly backups
└── wal/ # Write-Ahead Log for point-in-time recovery
Backup Schedule
| Type | Frequency | Retention | Container |
|---|---|---|---|
| Hourly | Every hour | 24 backups | backup-cron |
| Daily | Every day at midnight | 7 backups | backup-cron |
| Weekly | Every Sunday | 4 backups | backup-cron |
| Verification | Weekly | N/A | backup-verify-cron |
Verification
Automated Verification
The backup-verify-cron service runs weekly verification that:
- Checks backup file integrity (gzip test + pg_restore list)
- Verifies backup age is within acceptable threshold (168 hours / 7 days)
- Logs results to
/var/log/backup_verify.log
Manual Verification
# Verify all backup types
docker-compose -f docker-compose.prod.yml exec backup-verify-cron \
sh -c "/app/scripts/verify_backup.sh verify all"
# Verify specific backup type
docker-compose -f docker-compose.prod.yml exec backup-verify-cron \
sh -c "/app/scripts/verify_backup.sh verify daily"
# Check a specific backup file
docker-compose -f docker-compose.prod.yml exec backup-verify-cron \
sh -c "/app/scripts/verify_backup.sh check-integrity /var/backups/buywhere/daily/catalog_daily_20240418_000000.backup.gz"
View Verification Logs
# View recent verification logs
docker-compose -f docker-compose.prod.yml exec backup-verify-cron \
cat /var/log/backup_verify.log
# Follow logs in real-time
docker-compose -f docker-compose.prod.yml exec -f backup-verify-cron \
tail -f /var/log/backup_verify.log
Restore Procedure
Pre-Restore Checklist
- Notify team - Database restore causes downtime
- Verify backup integrity - Run verification before restore
- Check disk space - Ensure target has sufficient space
- Stop API services - Prevent writes during restore
Step 1: Stop API Services
# Scale down API to single instance
docker-compose -f docker-compose.prod.yml scale api=1
# Stop scraper scheduler to prevent writes
docker-compose -f docker-compose.prod.yml stop scraper-scheduler
Step 2: Identify Backup
List available backups:
# List all backups by type
docker-compose -f docker-compose.prod.yml exec backup-cron \
sh -c "/app/scripts/backup.sh list hourly"
docker-compose -f docker-compose.prod.yml exec backup-cron \
sh -c "/app/scripts/backup.sh list daily"
docker-compose -f docker-compose.prod.yml exec backup-cron \
sh -c "/app/scripts/backup.sh list weekly"
# Or directly
ls -lh /var/backups/buywhere/hourly/
ls -lh /var/backups/buywhere/daily/
ls -lh /var/backups/buywhere/weekly/
Step 3: Verify Backup
Before restoring, always verify backup integrity:
docker-compose -f docker-compose.prod.yml exec backup-cron \
sh -c "/app/scripts/backup.sh verify /var/backups/buywhere/daily/catalog_daily_YYYYMMDD_HHMMSS.backup.gz"
Step 4: Perform Restore
# Full restore from backup
docker-compose -f docker-compose.prod.yml exec backup-cron \
sh -c "/app/scripts/backup.sh restore /var/backups/buywhere/daily/catalog_daily_YYYYMMDD_HHMMSS.backup.gz catalog"
The restore script will:
- Create a temporary database
- Restore to the temporary database
- Verify the restore was successful
- Provide instructions for switching databases
Step 5: Complete Database Switch
After successful restore to temp database:
# Terminate existing connections to target database
PGPASSWORD=<password> psql -h db -p 5432 -U buywhere -d catalog -c \
"SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'catalog' AND pid <> pg_backend_pid();"
# Drop and rename
PGPASSWORD=<password> psql -h db -p 5432 -U buywhere -c "DROP DATABASE catalog;"
PGPASSWORD=<password> psql -h db -p 5432 -U buywhere -c "ALTER DATABASE catalog_restore_<pid> RENAME TO catalog;"
Step 6: Restart Services
# Restart API services
docker-compose -f docker-compose.prod.yml up -d api
# Restart scraper scheduler
docker-compose -f docker-compose.prod.yml up -d scraper-scheduler
# Scale API back to normal
docker-compose -f docker-compose.prod.yml scale api=2
Step 7: Verify Restore
# Check API health
curl https://api.buywhere.ai/health
# Check database connection
docker-compose -f docker-compose.prod.yml exec api \
psql -U buywhere -d catalog -c "SELECT COUNT(*) FROM products LIMIT 1;"
# Verify replication is working
docker-compose -f docker-compose.prod.yml exec db_replica \
psql -U buywhere -d catalog -c "SELECT pg_is_in_recovery();"
Point-in-Time Recovery (PITR)
For WAL-based point-in-time recovery:
- Identify the target recovery time
- Find the nearest base backup before target time
- Create recovery.conf with target time
- Start PostgreSQL in recovery mode
# Example: Recover to specific timestamp
docker-compose -f docker-compose.prod.yml exec db \
psql -U buywhere -d catalog -c "SELECT pg_create_restore_point('before_oops');"
# Then restore using WAL files up to that point
Alert Thresholds
| Alert | Condition | Severity |
|---|---|---|
| BackupVerificationFailed | pg_restore or gzip test fails | Critical |
| BackupMissing | No backups found for > 1 hour | Critical |
| BackupTooOld | Latest backup > 7 days old | Warning |
Common Issues
"Backup file is corrupted"
# Check if file is valid gzip
gzip -t /var/backups/buywhere/daily/catalog_daily_YYYYMMDD_HHMMSS.backup.gz
# Check if pg_restore can list contents
pg_restore --quiet --list /var/backups/buywhere/daily/catalog_daily_YYYYMMDD_HHMMSS.backup.gz
"Restore fails with permission error"
Ensure PostgreSQL user has permissions to create/drop databases.
"Disk space exhausted during restore"
Check available disk space:
df -h /var/backups/buywhere
Emergency Contacts
- Primary on-call: See PagerDuty schedule
- DBA team: dba@buywhere.ai
- Infrastructure: infra@buywhere.ai