staging-deploy-hardening-checklist

Staging Deployment Hardening and Rollback Checklist

Issue: BUY-2474
Environment: Staging (Kubernetes)
Purpose: Ensure staging deployments are hardened with proper rollback procedures
Last Updated: 2026-04-19
Status: ACTIVE

Pre-Deployment Requirements

[ ] 1. Pre-Deployment Backup

Run pre-deployment backup script: ./scripts/pre-deploy-backup.sh backup
Verify backup was created successfully
Confirm backup file exists in ./backups/ directory
Optional: Verify backup integrity with ./scripts/pre-deploy-backup.sh verify

[ ] 2. Rollback State Capture

Confirm .rollback_state file is created/updated during backup process
Verify rollback state contains:
- DEPLOYMENT_ID
- DEPLOY_TIME
- ENVIRONMENT=staging
- AWS_CLUSTER (if applicable)
- PREV_TASK (ECS task definition, if applicable)
- PREV_SHA (git SHA)
- BACKUP_FILE path

[ ] 3. Health Check Baseline

Run enhanced health check: ./scripts/enhanced-health-check.sh
Run smoke tests: ./scripts/pre-deploy-smoke-test.sh
All checks must pass before proceeding
Document current health status

[ ] 4. Staging-Specific Verification

Confirm kubectl context is set to staging: kubectl config current-context
Verify staging namespace exists: kubectl get namespace staging
Check current deployment revision: kubectl rollout status deployment/buywhere-api -n staging
Verify resource requests/limits are appropriate for staging

[ ] 5. Notification Preparation

Ensure deployment-notify.sh is configured for staging
Verify Slack/webhook notifications are enabled for staging deployments
Notify team of impending deployment (if required by policy)

Deployment Process

[ ] 6. Deployment Execution

For automated deployments: Verify GitHub Action triggered on push to main/master
For manual deployments: Follow manual deployment steps from runbook
Monitor deployment progress: kubectl rollout status deployment/buywhere-api -n staging --timeout=300s

[ ] 7. Deployment Validation

Verify all pods are running: kubectl get pods -n staging -l app=buywhere-api
Check rollout history: kubectl rollout history deployment/buywhere-api -n staging
Confirm new image is deployed: kubectl describe deployment buywhere-api -n staging | grep Image

[ ] 8. Immediate Post-Deployment Health Check

Run enhanced health check against staging endpoint: API_URL=https://api-staging.buywhere.io ./scripts/enhanced-health-check.sh
Run smoke tests against staging: ./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io
Verify response times are within acceptable thresholds
Check for any error logs in the first 5 minutes post-deployment

Post-Deployment Validation (0-30 minutes)

[ ] 9. Monitoring Validation

Verify deployment monitoring is active: ./scripts/deployment-monitor.sh status
Check circuit breaker is closed: ./scripts/deployment-monitor.sh circuit-breaker
Confirm Grafana dashboards are receiving data from staging
Verify Loki is collecting logs from staging pods

[ ] 10. API Functionality Testing

Test core API endpoints: /health, /ready, /metrics
Test key business endpoints (products, search, etc.)
Verify database connectivity through API
Verify Redis connectivity through API
Test any recently changed features

[ ] 11. Performance Baseline

Check API response times (p50, p95, p99) are within staging norms
Verify error rate is near 0%
Check resource utilization (CPU, memory) is within expected ranges
Confirm no regression in key performance indicators

[ ] 12. Log Validation

Check for new error patterns in staging logs
Verify log rotation is functioning
Confirm structured logging is intact
Check for any deprecated API usage warnings

Rollback Procedures Verification

[ ] 13. Rollback Documentation Review

Review rollback procedures in deployment runbook
Confirm team members know how to execute rollback
Verify rollback runbooks are up-to-date

[ ] 14. Rollback Testing (Optional but Recommended)

In a controlled window, test rollback to previous version:
1. Note current deployment revision
2. Rollback to previous revision: kubectl rollout undo deployment/buywhere-api -n staging
3. Wait for rollback completion
4. Verify system is functioning on previous version
5. Roll forward again to test version: kubectl rollout undo deployment/buywhere-api -n staging (to re-apply)
Document rollback test results

[ ] 15. Automatic Rollback Configuration

Verify GitHub Actions automatic rollback triggers are configured:
- Health check failure after deployment
- Canary analysis error rate > 10% (staging threshold)
- Latency P99 > 2000ms (staging threshold)
Check deployment monitor circuit breaker settings for staging

[ ] 16. Database Rollback Verification

Verify pre-deployment backup can be restored (test in isolation if possible)
Confirm database migration history is intact
Verify no pending migrations that would complicate rollback
Document procedure for database rollback if needed

Post-Deployment (30+ minutes)

[ ] 17. Extended Monitoring

Continue monitoring for 30+ minutes post-deployment
Watch for delayed issues or performance degradation
Verify monitoring alerts are functioning correctly
Check for any cascading failures

[ ] 18. Deployment Completion Notification

Send deployment completion notification via deployment-notify.sh
Update any deployment tracking systems
Document any observations or issues encountered
Hand off to on-call engineer if applicable

[ ] 19. Cleanup

Verify temporary resources are cleaned up
Check that no unnecessary pods or services are running
Confirm logging levels are appropriate
Ensure debugging features are disabled

[ ] 20. Checklist Completion

Mark all checklist items as complete
File any issues or blockers discovered during deployment
Update checklist based on lessons learned
Sign off on deployment readiness

Emergency Procedures

[ ] 21. Emergency Contact Verification

Verify on-call engineer is notified and available
Confirm PagerDuty/escalation paths are functional
Have emergency contacts readily available

[ ] 22. Emergency Rollback Access

Ensure all team members have necessary access for emergency rollback
Verify kubectl access to staging cluster
Confirm AWS access (if ECS fallback needed) is available
Document emergency procedures in easily accessible location

[ ] 23. Communication Plan

Verify stakeholder notification procedures
Confirm status page update process
Prepare communication templates for incidents
Establish war room procedures if needed

Sign-Off

Deployment Ready: ☐ Yes ☐ No
Engineer: _________________________
Timestamp: _______________________
Notes: ________________________________________________________
Issues/Blockers: ________________________________________________