← Back to documentation

staging-deploy-hardening-checklist

Staging Deployment Hardening and Rollback Checklist

Issue: BUY-2474
Environment: Staging (Kubernetes)
Purpose: Ensure staging deployments are hardened with proper rollback procedures
Last Updated: 2026-04-19
Status: ACTIVE


Pre-Deployment Requirements

[ ] 1. Pre-Deployment Backup

  • Run pre-deployment backup script: ./scripts/pre-deploy-backup.sh backup
  • Verify backup was created successfully
  • Confirm backup file exists in ./backups/ directory
  • Optional: Verify backup integrity with ./scripts/pre-deploy-backup.sh verify

[ ] 2. Rollback State Capture

  • Confirm .rollback_state file is created/updated during backup process
  • Verify rollback state contains:
    • DEPLOYMENT_ID
    • DEPLOY_TIME
    • ENVIRONMENT=staging
    • AWS_CLUSTER (if applicable)
    • PREV_TASK (ECS task definition, if applicable)
    • PREV_SHA (git SHA)
    • BACKUP_FILE path

[ ] 3. Health Check Baseline

  • Run enhanced health check: ./scripts/enhanced-health-check.sh
  • Run smoke tests: ./scripts/pre-deploy-smoke-test.sh
  • All checks must pass before proceeding
  • Document current health status

[ ] 4. Staging-Specific Verification

  • Confirm kubectl context is set to staging: kubectl config current-context
  • Verify staging namespace exists: kubectl get namespace staging
  • Check current deployment revision: kubectl rollout status deployment/buywhere-api -n staging
  • Verify resource requests/limits are appropriate for staging

[ ] 5. Notification Preparation

  • Ensure deployment-notify.sh is configured for staging
  • Verify Slack/webhook notifications are enabled for staging deployments
  • Notify team of impending deployment (if required by policy)

Deployment Process

[ ] 6. Deployment Execution

  • For automated deployments: Verify GitHub Action triggered on push to main/master
  • For manual deployments: Follow manual deployment steps from runbook
  • Monitor deployment progress: kubectl rollout status deployment/buywhere-api -n staging --timeout=300s

[ ] 7. Deployment Validation

  • Verify all pods are running: kubectl get pods -n staging -l app=buywhere-api
  • Check rollout history: kubectl rollout history deployment/buywhere-api -n staging
  • Confirm new image is deployed: kubectl describe deployment buywhere-api -n staging | grep Image

[ ] 8. Immediate Post-Deployment Health Check

  • Run enhanced health check against staging endpoint: API_URL=https://api-staging.buywhere.io ./scripts/enhanced-health-check.sh
  • Run smoke tests against staging: ./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io
  • Verify response times are within acceptable thresholds
  • Check for any error logs in the first 5 minutes post-deployment

Post-Deployment Validation (0-30 minutes)

[ ] 9. Monitoring Validation

  • Verify deployment monitoring is active: ./scripts/deployment-monitor.sh status
  • Check circuit breaker is closed: ./scripts/deployment-monitor.sh circuit-breaker
  • Confirm Grafana dashboards are receiving data from staging
  • Verify Loki is collecting logs from staging pods

[ ] 10. API Functionality Testing

  • Test core API endpoints: /health, /ready, /metrics
  • Test key business endpoints (products, search, etc.)
  • Verify database connectivity through API
  • Verify Redis connectivity through API
  • Test any recently changed features

[ ] 11. Performance Baseline

  • Check API response times (p50, p95, p99) are within staging norms
  • Verify error rate is near 0%
  • Check resource utilization (CPU, memory) is within expected ranges
  • Confirm no regression in key performance indicators

[ ] 12. Log Validation

  • Check for new error patterns in staging logs
  • Verify log rotation is functioning
  • Confirm structured logging is intact
  • Check for any deprecated API usage warnings

Rollback Procedures Verification

[ ] 13. Rollback Documentation Review

  • Review rollback procedures in deployment runbook
  • Confirm team members know how to execute rollback
  • Verify rollback runbooks are up-to-date

[ ] 14. Rollback Testing (Optional but Recommended)

  • In a controlled window, test rollback to previous version:
    1. Note current deployment revision
    2. Rollback to previous revision: kubectl rollout undo deployment/buywhere-api -n staging
    3. Wait for rollback completion
    4. Verify system is functioning on previous version
    5. Roll forward again to test version: kubectl rollout undo deployment/buywhere-api -n staging (to re-apply)
  • Document rollback test results

[ ] 15. Automatic Rollback Configuration

  • Verify GitHub Actions automatic rollback triggers are configured:
    • Health check failure after deployment
    • Canary analysis error rate > 10% (staging threshold)
    • Latency P99 > 2000ms (staging threshold)
  • Check deployment monitor circuit breaker settings for staging

[ ] 16. Database Rollback Verification

  • Verify pre-deployment backup can be restored (test in isolation if possible)
  • Confirm database migration history is intact
  • Verify no pending migrations that would complicate rollback
  • Document procedure for database rollback if needed

Post-Deployment (30+ minutes)

[ ] 17. Extended Monitoring

  • Continue monitoring for 30+ minutes post-deployment
  • Watch for delayed issues or performance degradation
  • Verify monitoring alerts are functioning correctly
  • Check for any cascading failures

[ ] 18. Deployment Completion Notification

  • Send deployment completion notification via deployment-notify.sh
  • Update any deployment tracking systems
  • Document any observations or issues encountered
  • Hand off to on-call engineer if applicable

[ ] 19. Cleanup

  • Verify temporary resources are cleaned up
  • Check that no unnecessary pods or services are running
  • Confirm logging levels are appropriate
  • Ensure debugging features are disabled

[ ] 20. Checklist Completion

  • Mark all checklist items as complete
  • File any issues or blockers discovered during deployment
  • Update checklist based on lessons learned
  • Sign off on deployment readiness

Emergency Procedures

[ ] 21. Emergency Contact Verification

  • Verify on-call engineer is notified and available
  • Confirm PagerDuty/escalation paths are functional
  • Have emergency contacts readily available

[ ] 22. Emergency Rollback Access

  • Ensure all team members have necessary access for emergency rollback
  • Verify kubectl access to staging cluster
  • Confirm AWS access (if ECS fallback needed) is available
  • Document emergency procedures in easily accessible location

[ ] 23. Communication Plan

  • Verify stakeholder notification procedures
  • Confirm status page update process
  • Prepare communication templates for incidents
  • Establish war room procedures if needed

Sign-Off

Deployment Ready: ☐ Yes ☐ No
Engineer: _________________________
Timestamp: _______________________
Notes: ________________________________________________________
Issues/Blockers: ________________________________________________


References