Staging Deployment Rollback Checklist
Issue: BUY-2474
Environment: Staging (Kubernetes)
Purpose: Ensure staging deployments can be rolled back quickly and safely
Last Updated: 2026-04-19
Status: ACTIVE
Pre-Rollback Assessment
[ ] 1. Verify Rollback Is Necessary
- Confirm deployment health check failure
- Check error rates exceed staging threshold (>10%)
- Verify latency P99 exceeds staging threshold (>2000ms)
- Confirm pods are not ready/crashing
- Document specific failure symptoms
[ ] 2. Gather Deployment Information
- Check current deployment status:
kubectl get pods -n staging -l app=buywhere-api - Get rollout history:
kubectl rollout history deployment/buywhere-api -n staging - Identify current revision and previous known-good revision
- Note deployment timestamp and image tag
[ ] 3. Verify Rollback State
- Confirm
.rollback_statefile exists and is readable - Check rollback state contains required fields:
- DEPLOYMENT_ID
- PREV_TASK (for ECS, if applicable)
- PREV_SHA (git SHA of previous version)
- BACKUP_FILE (if database backup was taken)
- Validate that PREV_SHA corresponds to a known-good version
[ ] 4. Pre-Rollback Backup (Optional but Recommended)
- Consider taking a backup before rolling back in case forward recovery is needed
- Run:
./scripts/pre-deploy-backup.sh backup(with different deployment ID) - Verify backup completes successfully
[ ] 5. Stakeholder Notification
- Notify team that rollback is being initiated
- Update status page or internal communications
- Ensure on-call engineer is aware
- Document rollback reason for post-mortem
Rollback Execution
[ ] 6. Kubernetes Rollback Procedure
- Execute rollback to previous revision:
kubectl rollout undo deployment/buywhere-api -n staging - Alternative: Rollback to specific revision if known:
kubectl rollout undo deployment/buywhere-api -n staging --to-revision=<revision-number> - Monitor rollout status:
kubectl rollout status deployment/buywhere-api -n staging --timeout=300s
[ ] 7. ECS Rollback Procedure (If Applicable)
- Source rollback state:
source .rollback_state - Execute AWS ECS rollback:
aws ecs update-service \ --cluster "$AWS_CLUSTER" \ --service buywhere-api \ --task-definition "$PREV_TASK" \ --force-new-deployment - Wait for service stability:
aws ecs wait services-stable \ --cluster "$AWS_CLUSTER" \ --services buywhere-api
[ ] 8. Post-Rollback Validation
- Verify all pods are running:
kubectl get pods -n staging -l app=buywhere-api - Confirm rollout completed successfully
- Check that previous version is now deployed:
kubectl describe deployment buywhere-api -n staging | grep Image - Run immediate health check:
API_URL=https://api-staging.buywhere.io ./scripts/enhanced-health-check.sh - Run smoke tests against staging endpoint:
./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io
Post-Rollback Validation (0-30 minutes)
[ ] 9. Health and Performance Monitoring
- Verify API responds with healthy status
- Check error rates return to acceptable levels (<1%)
- Confirm latency metrics return to normal ranges
- Verify resource utilization stabilizes
- Monitor for any cascading issues from rollback
[ ] 10. Data Integrity Verification
- Confirm database connectivity is intact
- Verify no data corruption occurred during rollback
- Check that recent migrations (if any) are handled appropriately
- Verify read/write operations function correctly
[ ] 11. Log Analysis
- Check for error logs immediately after rollback
- Verify logging is functioning normally
- Look for any warning signs of incomplete rollback
- Confirm audit trails are intact
[ ] 12. Functional Testing
- Test core API endpoints:
/health,/ready,/metrics - Test key business functionality
- Verify any recently disabled features are properly off
- Check that dependent services (scrapers, MCP, etc.) are unaffected
Verification of Rollback Success
[ ] 13. Success Criteria Confirmation
- API health endpoint returns 200 OK
- Error rate < 1% for 5+ minutes
- Latency P99 < 2000ms for 5+ minutes
- All required pods are running (replica count matches)
- No crash loops or restart loops
- Manual testing confirms core functionality works
[ ] 14. Monitoring System Validation
- Verify deployment monitoring detects healthy state
- Confirm circuit breaker resets to closed state (if applicable)
- Check that alerts resolve appropriately
- Verify dashboards show normal operating metrics
[ ] 15. Notification and Documentation
- Send rollback completion notification via deployment-notify.sh
- Update incident/ticket with rollback details
- Document time-to-rollback and impact assessment
- Notify stakeholders of service restoration
Post-Rollback Activities
[ ] 16. Root Cause Analysis Preparation
- Preserve logs and metrics from failed deployment
- Save rollout history for analysis:
kubectl rollout history deployment/buywhere-api -n staging - Document exact failure symptoms and rollback timing
- Preserve .rollback_state file for investigation
[ ] 17. Follow-Up Actions
- Create ticket to investigate root cause of failure
- Schedule blameless post-mortem if significant impact
- Update runbooks or checklists based on lessons learned
- Consider if additional safeguards or monitoring are needed
[ ] 18. Re-deployment Planning
- Determine if and when to attempt re-deployment
- Identify any blockers that need resolution first
- Plan any necessary changes before re-deployment attempt
- Schedule re-deployment for appropriate time window
[ ] 19. Checklist Completion
- Mark all rollback checklist items as complete
- File any issues discovered during rollback process
- Update rollback procedures based on experience
- Sign off on rollback completion
Emergency Rollback Procedures
[ ] 20. Fast Rollback (When Time Is Critical)
- Use quick rollback script if available:
./scripts/rollback.sh quick staging - Or use kubectl rollout undo with minimal timeout
- Prioritize service restoration over detailed validation initially
- Follow up with comprehensive validation after service is stable
[ ] 21. Manual Intervention Procedures
- If automated rollback fails, prepare manual steps:
- Scale down new deployment to 0 replicas
- Scale up previous deployment (if still available)
- Update image tags manually if needed
- Verify traffic routing is correct
- Have kubectl patch/edit commands ready as last resort
[ ] 22. Communication Escalation
- If rollback fails, immediately escalate to on-call lead
- Prepare war room for extended troubleshooting
- Have executive/stakeholder communication ready
- Consider customer-facing notification if user impact
Sign-Off
Rollback Successful: ☐ Yes ☐ No
Engineer: _________________________
Timestamp: _______________________
Time to Rollback: ________________
Post-Rollback Validation Period: ______ minutes
Notes: ________________________________________________________
Issues Encountered: ___________________________________________