staging-rollback-checklist

Staging Deployment Rollback Checklist

Issue: BUY-2474
Environment: Staging (Kubernetes)
Purpose: Ensure staging deployments can be rolled back quickly and safely
Last Updated: 2026-04-19
Status: ACTIVE

Pre-Rollback Assessment

[ ] 1. Verify Rollback Is Necessary

Confirm deployment health check failure
Check error rates exceed staging threshold (>10%)
Verify latency P99 exceeds staging threshold (>2000ms)
Confirm pods are not ready/crashing
Document specific failure symptoms

[ ] 2. Gather Deployment Information

Check current deployment status: kubectl get pods -n staging -l app=buywhere-api
Get rollout history: kubectl rollout history deployment/buywhere-api -n staging
Identify current revision and previous known-good revision
Note deployment timestamp and image tag

[ ] 3. Verify Rollback State

Confirm .rollback_state file exists and is readable
Check rollback state contains required fields:
- DEPLOYMENT_ID
- PREV_TASK (for ECS, if applicable)
- PREV_SHA (git SHA of previous version)
- BACKUP_FILE (if database backup was taken)
Validate that PREV_SHA corresponds to a known-good version

[ ] 4. Pre-Rollback Backup (Optional but Recommended)

Consider taking a backup before rolling back in case forward recovery is needed
Run: ./scripts/pre-deploy-backup.sh backup (with different deployment ID)
Verify backup completes successfully

[ ] 5. Stakeholder Notification

Notify team that rollback is being initiated
Update status page or internal communications
Ensure on-call engineer is aware
Document rollback reason for post-mortem

Rollback Execution

[ ] 6. Kubernetes Rollback Procedure

Execute rollback to previous revision:

kubectl rollout undo deployment/buywhere-api -n staging

Alternative: Rollback to specific revision if known:

kubectl rollout undo deployment/buywhere-api -n staging --to-revision=<revision-number>

Monitor rollout status:

kubectl rollout status deployment/buywhere-api -n staging --timeout=300s

[ ] 7. ECS Rollback Procedure (If Applicable)

Source rollback state: source .rollback_state

Execute AWS ECS rollback:

aws ecs update-service \
  --cluster "$AWS_CLUSTER" \
  --service buywhere-api \
  --task-definition "$PREV_TASK" \
  --force-new-deployment

Wait for service stability:

aws ecs wait services-stable \
  --cluster "$AWS_CLUSTER" \
  --services buywhere-api

[ ] 8. Post-Rollback Validation

Verify all pods are running: kubectl get pods -n staging -l app=buywhere-api
Confirm rollout completed successfully

Check that previous version is now deployed:

kubectl describe deployment buywhere-api -n staging | grep Image

Run immediate health check:

API_URL=https://api-staging.buywhere.io ./scripts/enhanced-health-check.sh

Run smoke tests against staging endpoint:

./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io

Post-Rollback Validation (0-30 minutes)

[ ] 9. Health and Performance Monitoring

Verify API responds with healthy status
Check error rates return to acceptable levels (<1%)
Confirm latency metrics return to normal ranges
Verify resource utilization stabilizes
Monitor for any cascading issues from rollback

[ ] 10. Data Integrity Verification

Confirm database connectivity is intact
Verify no data corruption occurred during rollback
Check that recent migrations (if any) are handled appropriately
Verify read/write operations function correctly

[ ] 11. Log Analysis

Check for error logs immediately after rollback
Verify logging is functioning normally
Look for any warning signs of incomplete rollback
Confirm audit trails are intact

[ ] 12. Functional Testing

Test core API endpoints: /health, /ready, /metrics
Test key business functionality
Verify any recently disabled features are properly off
Check that dependent services (scrapers, MCP, etc.) are unaffected

Verification of Rollback Success

[ ] 13. Success Criteria Confirmation

API health endpoint returns 200 OK
Error rate < 1% for 5+ minutes
Latency P99 < 2000ms for 5+ minutes
All required pods are running (replica count matches)
No crash loops or restart loops
Manual testing confirms core functionality works

[ ] 14. Monitoring System Validation

Verify deployment monitoring detects healthy state
Confirm circuit breaker resets to closed state (if applicable)
Check that alerts resolve appropriately
Verify dashboards show normal operating metrics

[ ] 15. Notification and Documentation

Send rollback completion notification via deployment-notify.sh
Update incident/ticket with rollback details
Document time-to-rollback and impact assessment
Notify stakeholders of service restoration

Post-Rollback Activities

[ ] 16. Root Cause Analysis Preparation

Preserve logs and metrics from failed deployment
Save rollout history for analysis: kubectl rollout history deployment/buywhere-api -n staging
Document exact failure symptoms and rollback timing
Preserve .rollback_state file for investigation

[ ] 17. Follow-Up Actions

Create ticket to investigate root cause of failure
Schedule blameless post-mortem if significant impact
Update runbooks or checklists based on lessons learned
Consider if additional safeguards or monitoring are needed

[ ] 18. Re-deployment Planning

Determine if and when to attempt re-deployment
Identify any blockers that need resolution first
Plan any necessary changes before re-deployment attempt
Schedule re-deployment for appropriate time window

[ ] 19. Checklist Completion

Mark all rollback checklist items as complete
File any issues discovered during rollback process
Update rollback procedures based on experience
Sign off on rollback completion

Emergency Rollback Procedures

[ ] 20. Fast Rollback (When Time Is Critical)

Use quick rollback script if available: ./scripts/rollback.sh quick staging
Or use kubectl rollout undo with minimal timeout
Prioritize service restoration over detailed validation initially
Follow up with comprehensive validation after service is stable

[ ] 21. Manual Intervention Procedures

If automated rollback fails, prepare manual steps:
1. Scale down new deployment to 0 replicas
2. Scale up previous deployment (if still available)
3. Update image tags manually if needed
4. Verify traffic routing is correct
Have kubectl patch/edit commands ready as last resort

[ ] 22. Communication Escalation

If rollback fails, immediately escalate to on-call lead
Prepare war room for extended troubleshooting
Have executive/stakeholder communication ready
Consider customer-facing notification if user impact

Sign-Off

Rollback Successful: ☐ Yes ☐ No
Engineer: _________________________
Timestamp: _______________________
Time to Rollback: ________________
Post-Rollback Validation Period: ______ minutes
Notes: ________________________________________________________
Issues Encountered: ___________________________________________