← Back to documentation

staging-rollback-checklist

Staging Deployment Rollback Checklist

Issue: BUY-2474
Environment: Staging (Kubernetes)
Purpose: Ensure staging deployments can be rolled back quickly and safely
Last Updated: 2026-04-19
Status: ACTIVE


Pre-Rollback Assessment

[ ] 1. Verify Rollback Is Necessary

  • Confirm deployment health check failure
  • Check error rates exceed staging threshold (>10%)
  • Verify latency P99 exceeds staging threshold (>2000ms)
  • Confirm pods are not ready/crashing
  • Document specific failure symptoms

[ ] 2. Gather Deployment Information

  • Check current deployment status: kubectl get pods -n staging -l app=buywhere-api
  • Get rollout history: kubectl rollout history deployment/buywhere-api -n staging
  • Identify current revision and previous known-good revision
  • Note deployment timestamp and image tag

[ ] 3. Verify Rollback State

  • Confirm .rollback_state file exists and is readable
  • Check rollback state contains required fields:
    • DEPLOYMENT_ID
    • PREV_TASK (for ECS, if applicable)
    • PREV_SHA (git SHA of previous version)
    • BACKUP_FILE (if database backup was taken)
  • Validate that PREV_SHA corresponds to a known-good version

[ ] 4. Pre-Rollback Backup (Optional but Recommended)

  • Consider taking a backup before rolling back in case forward recovery is needed
  • Run: ./scripts/pre-deploy-backup.sh backup (with different deployment ID)
  • Verify backup completes successfully

[ ] 5. Stakeholder Notification

  • Notify team that rollback is being initiated
  • Update status page or internal communications
  • Ensure on-call engineer is aware
  • Document rollback reason for post-mortem

Rollback Execution

[ ] 6. Kubernetes Rollback Procedure

  • Execute rollback to previous revision:
    kubectl rollout undo deployment/buywhere-api -n staging
    
  • Alternative: Rollback to specific revision if known:
    kubectl rollout undo deployment/buywhere-api -n staging --to-revision=<revision-number>
    
  • Monitor rollout status:
    kubectl rollout status deployment/buywhere-api -n staging --timeout=300s
    

[ ] 7. ECS Rollback Procedure (If Applicable)

  • Source rollback state: source .rollback_state
  • Execute AWS ECS rollback:
    aws ecs update-service \
      --cluster "$AWS_CLUSTER" \
      --service buywhere-api \
      --task-definition "$PREV_TASK" \
      --force-new-deployment
    
  • Wait for service stability:
    aws ecs wait services-stable \
      --cluster "$AWS_CLUSTER" \
      --services buywhere-api
    

[ ] 8. Post-Rollback Validation

  • Verify all pods are running: kubectl get pods -n staging -l app=buywhere-api
  • Confirm rollout completed successfully
  • Check that previous version is now deployed:
    kubectl describe deployment buywhere-api -n staging | grep Image
    
  • Run immediate health check:
    API_URL=https://api-staging.buywhere.io ./scripts/enhanced-health-check.sh
    
  • Run smoke tests against staging endpoint:
    ./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io
    

Post-Rollback Validation (0-30 minutes)

[ ] 9. Health and Performance Monitoring

  • Verify API responds with healthy status
  • Check error rates return to acceptable levels (<1%)
  • Confirm latency metrics return to normal ranges
  • Verify resource utilization stabilizes
  • Monitor for any cascading issues from rollback

[ ] 10. Data Integrity Verification

  • Confirm database connectivity is intact
  • Verify no data corruption occurred during rollback
  • Check that recent migrations (if any) are handled appropriately
  • Verify read/write operations function correctly

[ ] 11. Log Analysis

  • Check for error logs immediately after rollback
  • Verify logging is functioning normally
  • Look for any warning signs of incomplete rollback
  • Confirm audit trails are intact

[ ] 12. Functional Testing

  • Test core API endpoints: /health, /ready, /metrics
  • Test key business functionality
  • Verify any recently disabled features are properly off
  • Check that dependent services (scrapers, MCP, etc.) are unaffected

Verification of Rollback Success

[ ] 13. Success Criteria Confirmation

  • API health endpoint returns 200 OK
  • Error rate < 1% for 5+ minutes
  • Latency P99 < 2000ms for 5+ minutes
  • All required pods are running (replica count matches)
  • No crash loops or restart loops
  • Manual testing confirms core functionality works

[ ] 14. Monitoring System Validation

  • Verify deployment monitoring detects healthy state
  • Confirm circuit breaker resets to closed state (if applicable)
  • Check that alerts resolve appropriately
  • Verify dashboards show normal operating metrics

[ ] 15. Notification and Documentation

  • Send rollback completion notification via deployment-notify.sh
  • Update incident/ticket with rollback details
  • Document time-to-rollback and impact assessment
  • Notify stakeholders of service restoration

Post-Rollback Activities

[ ] 16. Root Cause Analysis Preparation

  • Preserve logs and metrics from failed deployment
  • Save rollout history for analysis: kubectl rollout history deployment/buywhere-api -n staging
  • Document exact failure symptoms and rollback timing
  • Preserve .rollback_state file for investigation

[ ] 17. Follow-Up Actions

  • Create ticket to investigate root cause of failure
  • Schedule blameless post-mortem if significant impact
  • Update runbooks or checklists based on lessons learned
  • Consider if additional safeguards or monitoring are needed

[ ] 18. Re-deployment Planning

  • Determine if and when to attempt re-deployment
  • Identify any blockers that need resolution first
  • Plan any necessary changes before re-deployment attempt
  • Schedule re-deployment for appropriate time window

[ ] 19. Checklist Completion

  • Mark all rollback checklist items as complete
  • File any issues discovered during rollback process
  • Update rollback procedures based on experience
  • Sign off on rollback completion

Emergency Rollback Procedures

[ ] 20. Fast Rollback (When Time Is Critical)

  • Use quick rollback script if available: ./scripts/rollback.sh quick staging
  • Or use kubectl rollout undo with minimal timeout
  • Prioritize service restoration over detailed validation initially
  • Follow up with comprehensive validation after service is stable

[ ] 21. Manual Intervention Procedures

  • If automated rollback fails, prepare manual steps:
    1. Scale down new deployment to 0 replicas
    2. Scale up previous deployment (if still available)
    3. Update image tags manually if needed
    4. Verify traffic routing is correct
  • Have kubectl patch/edit commands ready as last resort

[ ] 22. Communication Escalation

  • If rollback fails, immediately escalate to on-call lead
  • Prepare war room for extended troubleshooting
  • Have executive/stakeholder communication ready
  • Consider customer-facing notification if user impact

Sign-Off

Rollback Successful: ☐ Yes ☐ No
Engineer: _________________________
Timestamp: _______________________
Time to Rollback: ________________
Post-Rollback Validation Period: ______ minutes
Notes: ________________________________________________________
Issues Encountered: ___________________________________________


References