Staging Deployment Hardening and Rollback Checklist
Issue: BUY-2474
Environment: Staging (Kubernetes)
Purpose: Ensure staging deployments are hardened with proper rollback procedures
Last Updated: 2026-04-19
Status: ACTIVE
Pre-Deployment Requirements
[ ] 1. Pre-Deployment Backup
- Run pre-deployment backup script:
./scripts/pre-deploy-backup.sh backup - Verify backup was created successfully
- Confirm backup file exists in
./backups/directory - Optional: Verify backup integrity with
./scripts/pre-deploy-backup.sh verify
[ ] 2. Rollback State Capture
- Confirm
.rollback_statefile is created/updated during backup process - Verify rollback state contains:
- DEPLOYMENT_ID
- DEPLOY_TIME
- ENVIRONMENT=staging
- AWS_CLUSTER (if applicable)
- PREV_TASK (ECS task definition, if applicable)
- PREV_SHA (git SHA)
- BACKUP_FILE path
[ ] 3. Health Check Baseline
- Run enhanced health check:
./scripts/enhanced-health-check.sh - Run smoke tests:
./scripts/pre-deploy-smoke-test.sh - All checks must pass before proceeding
- Document current health status
[ ] 4. Staging-Specific Verification
- Confirm kubectl context is set to staging:
kubectl config current-context - Verify staging namespace exists:
kubectl get namespace staging - Check current deployment revision:
kubectl rollout status deployment/buywhere-api -n staging - Verify resource requests/limits are appropriate for staging
[ ] 5. Notification Preparation
- Ensure deployment-notify.sh is configured for staging
- Verify Slack/webhook notifications are enabled for staging deployments
- Notify team of impending deployment (if required by policy)
Deployment Process
[ ] 6. Deployment Execution
- For automated deployments: Verify GitHub Action triggered on push to main/master
- For manual deployments: Follow manual deployment steps from runbook
- Monitor deployment progress:
kubectl rollout status deployment/buywhere-api -n staging --timeout=300s
[ ] 7. Deployment Validation
- Verify all pods are running:
kubectl get pods -n staging -l app=buywhere-api - Check rollout history:
kubectl rollout history deployment/buywhere-api -n staging - Confirm new image is deployed:
kubectl describe deployment buywhere-api -n staging | grep Image
[ ] 8. Immediate Post-Deployment Health Check
- Run enhanced health check against staging endpoint:
API_URL=https://api-staging.buywhere.io ./scripts/enhanced-health-check.sh - Run smoke tests against staging:
./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io - Verify response times are within acceptable thresholds
- Check for any error logs in the first 5 minutes post-deployment
Post-Deployment Validation (0-30 minutes)
[ ] 9. Monitoring Validation
- Verify deployment monitoring is active:
./scripts/deployment-monitor.sh status - Check circuit breaker is closed:
./scripts/deployment-monitor.sh circuit-breaker - Confirm Grafana dashboards are receiving data from staging
- Verify Loki is collecting logs from staging pods
[ ] 10. API Functionality Testing
- Test core API endpoints:
/health,/ready,/metrics - Test key business endpoints (products, search, etc.)
- Verify database connectivity through API
- Verify Redis connectivity through API
- Test any recently changed features
[ ] 11. Performance Baseline
- Check API response times (p50, p95, p99) are within staging norms
- Verify error rate is near 0%
- Check resource utilization (CPU, memory) is within expected ranges
- Confirm no regression in key performance indicators
[ ] 12. Log Validation
- Check for new error patterns in staging logs
- Verify log rotation is functioning
- Confirm structured logging is intact
- Check for any deprecated API usage warnings
Rollback Procedures Verification
[ ] 13. Rollback Documentation Review
- Review rollback procedures in deployment runbook
- Confirm team members know how to execute rollback
- Verify rollback runbooks are up-to-date
[ ] 14. Rollback Testing (Optional but Recommended)
- In a controlled window, test rollback to previous version:
- Note current deployment revision
- Rollback to previous revision:
kubectl rollout undo deployment/buywhere-api -n staging - Wait for rollback completion
- Verify system is functioning on previous version
- Roll forward again to test version:
kubectl rollout undo deployment/buywhere-api -n staging(to re-apply)
- Document rollback test results
[ ] 15. Automatic Rollback Configuration
- Verify GitHub Actions automatic rollback triggers are configured:
- Health check failure after deployment
- Canary analysis error rate > 10% (staging threshold)
- Latency P99 > 2000ms (staging threshold)
- Check deployment monitor circuit breaker settings for staging
[ ] 16. Database Rollback Verification
- Verify pre-deployment backup can be restored (test in isolation if possible)
- Confirm database migration history is intact
- Verify no pending migrations that would complicate rollback
- Document procedure for database rollback if needed
Post-Deployment (30+ minutes)
[ ] 17. Extended Monitoring
- Continue monitoring for 30+ minutes post-deployment
- Watch for delayed issues or performance degradation
- Verify monitoring alerts are functioning correctly
- Check for any cascading failures
[ ] 18. Deployment Completion Notification
- Send deployment completion notification via deployment-notify.sh
- Update any deployment tracking systems
- Document any observations or issues encountered
- Hand off to on-call engineer if applicable
[ ] 19. Cleanup
- Verify temporary resources are cleaned up
- Check that no unnecessary pods or services are running
- Confirm logging levels are appropriate
- Ensure debugging features are disabled
[ ] 20. Checklist Completion
- Mark all checklist items as complete
- File any issues or blockers discovered during deployment
- Update checklist based on lessons learned
- Sign off on deployment readiness
Emergency Procedures
[ ] 21. Emergency Contact Verification
- Verify on-call engineer is notified and available
- Confirm PagerDuty/escalation paths are functional
- Have emergency contacts readily available
[ ] 22. Emergency Rollback Access
- Ensure all team members have necessary access for emergency rollback
- Verify kubectl access to staging cluster
- Confirm AWS access (if ECS fallback needed) is available
- Document emergency procedures in easily accessible location
[ ] 23. Communication Plan
- Verify stakeholder notification procedures
- Confirm status page update process
- Prepare communication templates for incidents
- Establish war room procedures if needed
Sign-Off
Deployment Ready: ☐ Yes ☐ No
Engineer: _________________________
Timestamp: _______________________
Notes: ________________________________________________________
Issues/Blockers: ________________________________________________