BuyWhere Deployment Runbook
Status: Active Last Updated: 2026-04-19 Owner: DevOps (Pipe agent) Version: 2.0 Classification: Internal
Overview
This runbook covers deployment procedures for BuyWhere API services across staging and production environments. It includes pre-deployment checks, deployment steps, rollback procedures, and emergency contacts.
Environments
| Environment | Cluster | API Endpoint | Region |
|---|---|---|---|
| Staging | buywhere-staging | https://api-staging.buywhere.io | ap-southeast-1 |
| Production | buywhere-prod | https://api.buywhere.io | ap-southeast-1 |
Deployment Strategies
Kubernetes (Primary)
- Staging: Automatic on push to main/master branch
- Production: Manual trigger on version tag (v*)
ECS (Legacy)
- Staging: Automatic on push to main/master branch
- Production: Manual trigger on version tag (v*)
Pre-Deployment Checklist
Complete before any deployment:
- All CI checks passing (lint, typecheck, tests)
- Backup created via
scripts/pre-deploy-backup.sh backup - Rollback state captured in
.rollback_state - Stakeholders notified (if production deployment)
- On-call engineer available for 30 minutes post-deployment
- Release notes prepared (for production)
Production-Specific Checks
- Version tag created and verified
- Change freeze not in effect
- Database migrations reviewed (if any)
- Feature flags configured appropriately
- Runbook reviewed for any special steps
Staging Deployment (Kubernetes)
Automated (via GitHub Actions)
Push to main or master triggers automatic deployment:
# Triggered automatically on push to main
git push origin main
Manual Deployment
# 1. Create pre-deployment backup
./scripts/pre-deploy-backup.sh backup
# 2. Run enhanced health check (baseline)
./scripts/enhanced-health-check.sh
# 3. Deploy via kustomize
kubectl config use-context staging
cd k8s/staging
kustomize edit set image ghcr.io/buywhere/buywhere-api=<new-image>
kustomize build . | kubectl apply -f -
# 4. Monitor rollout
kubectl rollout status deployment/buywhere-api -n staging --timeout=300s
# 5. Post-deployment health check
./scripts/enhanced-health-check.sh
Verify Deployment
# Check pods
kubectl get pods -n staging -l app=buywhere-api
# Check rollout history
kubectl rollout history deployment/buywhere-api -n staging
# Test health endpoint
curl -sf https://api-staging.buywhere.io/health
# Run smoke tests
./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io
Production Deployment (Kubernetes)
Trigger Options
-
Tag Push: Push a version tag
v*git tag v1.2.3 git push origin v1.2.3 -
Workflow Dispatch: Go to GitHub Actions → Deploy to Kubernetes Production → Run workflow
Manual Deployment Steps
# 1. Create pre-deployment backup
./scripts/pre-deploy-backup.sh backup ENVIRONMENT=production AWS_CLUSTER=buywhere-prod
# 2. Verify rollback state
cat .rollback_state
# 3. Update production kustomization
cd k8s/production
kustomize edit set image ghcr.io/buywhere/buywhere-api=<image-tag>
# 4. Apply deployment
kubectl config use-context production
kustomize build . | kubectl apply -f -
# 5. Monitor rollout (may take up to 10 minutes)
kubectl rollout status deployment/buywhere-api -n production --timeout=600s
# 6. Enhanced health check
API_URL=https://api.buywhere.io ENVIRONMENT=production ./scripts/enhanced-health-check.sh
# 7. Run canary analysis
./scripts/canary-analysis.sh --namespace production --duration 300
Post-Deployment Verification
# Verify all pods healthy
kubectl get pods -n production -l app=buywhere-api
# Check service endpoints
kubectl get svc -n production -l app=buywhere-api
# Monitor error rates for 5 minutes
watch -n 5 'curl -sf https://api.buywhere.io/metrics | grep error_rate'
# Verify data freshness
curl -sf https://api.buywhere.io/health | python3 -m json.tool
Rollback Procedures
Automatic Rollback
GitHub Actions automatically rolls back if:
- Health check fails after deployment
- Canary analysis detects error rate > 5%
- Latency P99 > 1000ms
Manual Rollback (Kubernetes)
# 1. Identify current and previous revision
kubectl rollout history deployment/buywhere-api -n production
# 2. Rollback to previous version
kubectl rollout undo deployment/buywhere-api -n production
# 3. Wait for rollback to complete
kubectl rollout status deployment/buywhere-api -n production --timeout=600s
# 4. Verify rollback
kubectl get pods -n production -l app=buywhere-api
curl -sf https://api.buywhere.io/health
Manual Rollback (ECS)
# 1. Source rollback state
source .rollback_state
# 2. Rollback to previous task definition
aws ecs update-service \
--cluster buywhere-prod \
--service buywhere-api \
--task-definition "$PREV_TASK" \
--force-new-deployment
# 3. Wait for stability
aws ecs wait services-stable \
--cluster buywhere-prod \
--services buywhere-api
# 4. Verify
curl -sf https://api.buywhere.io/health
Database Rollback
If database changes need to be rolled back:
# 1. List available backups
./scripts/pre-deploy-backup.sh list
# 2. Restore from backup
./scripts/pre-deploy-backup.sh restore backups/pre-deploy-<timestamp>.dump
Emergency Contacts
| Role | Contact | Escalation |
|---|---|---|
| Primary On-Call | PagerDuty | Immediate |
| DevOps Lead (Bolt) | @bolt | P1/P2 |
| CTO (Rex) | @cto | P1 only |
Escalation Path
P1 (Critical):
On-call → Bolt (10 min no response) → CTO (20 min no response)
P2 (High):
On-call → Bolt (30 min no response)
P3 (Medium):
On-call → Bolt (next business day)
Monitoring
Key Dashboards
| Dashboard | URL | Purpose |
|---|---|---|
| Grafana API | https://grafana.buywhere.io/d/api-main | API metrics |
| Loki Logs | https://grafana.buywhere.io/d/logs | Log exploration |
| Deployment | https://github.com/buywhere/buywhere-api/actions | CI/CD status |
Key Alerts
| Alert | Severity | Action |
|---|---|---|
| HighErrorRate | Critical | Page on-call |
| APIServiceDown | Critical | Page on-call |
| NoLogsReceived | Warning | Investigate |
Scripts Reference
| Script | Purpose |
|---|---|
scripts/enhanced-health-check.sh | Comprehensive health validation |
scripts/pre-deploy-backup.sh | Database backup and rollback state |
scripts/pre-deploy-smoke-test.sh | API smoke tests |
scripts/deployment-notify.sh | Slack notifications |
scripts/canary-analysis.sh | Production canary validation |
scripts/deployment-monitor.sh | Post-deployment monitoring with circuit breaker |
scripts/enhanced-health-check.sh | Comprehensive health validation with dependencies |
scripts/pre-deploy-backup.sh | Database backup and rollback state |
Deployment Thresholds
Deployment behavior is configurable via config/deployment-thresholds.yaml:
| Environment | Error Rate | Latency P99 | Circuit Breaker | Monitoring Duration |
|---|---|---|---|---|
| Staging | 10% | 2000ms | 2 failures | 300s |
| Production | 5% | 1500ms | 2 failures | 600s |
Customizing Thresholds
Edit config/deployment-thresholds.yaml to adjust thresholds:
staging:
error_rate_threshold: 10
latency_p99_threshold_ms: 2000
consecutive_failures_for_rollback: 2
monitoring_duration_seconds: 300
Feature Flags
Feature Flag System
The API supports dynamic feature flags via:
- Environment Variables:
FEATURE_<FLAG_NAME>=true - Kubernetes ConfigMap: Synced every 30 seconds from
buywhere-feature-flagsConfigMap - Runtime Overrides: Via admin API endpoints
Available Flags
| Flag | Description | Default |
|---|---|---|
FEATURE_NEW_SEARCH | New search algorithm | false |
FEATURE_BULK_INGEST | Bulk product ingestion | false |
FEATURE_MCP_SERVER | MCP server integration | true |
FEATURE_ADVANCED_ANALYTICS | Advanced analytics dashboard | false |
FEATURE_EXPERIMENTAL_SCRAPER | Experimental scraper features | false |
Managing Feature Flags
# Via API (admin access required)
GET /admin/feature-flags # List all flags
GET /admin/feature-flags/{flag_name} # Get specific flag
POST /admin/feature-flags/override # Set runtime override
DELETE /admin/feature-flags/override/{flag_name} # Clear override
Feature Flag SDK
from app.services.feature_flags import is_feature_enabled, require_feature_flag
# Check if feature is enabled
if is_feature_enabled("feature_new_search"):
# Use new search
# Require feature for endpoint
@app.get("/experimental")
@require_feature_flag("feature_experimental_scraper")
async def experimental_endpoint():
return {"status": "experimental"}
Post-Deployment Monitoring
Automated Monitoring (Post-Deployment)
After each deployment, automated monitoring runs for 5-10 minutes:
# Monitor deployment health
./scripts/deployment-monitor.sh monitor \
--environment production \
--api-url https://api.buywhere.io \
--duration 600 \
--check-interval 30
Circuit Breaker Pattern
The monitoring system implements a circuit breaker pattern:
- Closed: Normal operation, deployments allowed
- Open: Too many failures, deployments blocked
- Threshold: 2 consecutive failures trips the breaker
- Critical Failures: Immediate rollback triggered on 5xx errors
# Check circuit breaker status
./scripts/deployment-monitor.sh circuit-breaker
# Reset circuit breaker (if needed)
rm .deployment_state/circuit_breaker_production.json
Enhanced Health Checks
The enhanced health check validates:
- Core endpoints (health, ready, metrics)
- JSON field validation
- Database connectivity
- Redis connectivity
- PgBouncer connectivity
- Connection pool health
- Startup time validation
# Run enhanced health check
./scripts/enhanced-health-check.sh \
--api-url https://api.buywhere.io \
--environment production
Check Current Deployment Status
# View current deployment health
./scripts/deployment-monitor.sh status
# View deployment history
./scripts/deployment-monitor.sh history
Enhanced Rollback Procedures
Standard Rollback (Kubernetes)
# Rollback to previous revision
./scripts/rollback.sh k8s production
# Rollback to specific revision
./scripts/rollback.sh k8s production --revision 3
# Rollback with annotation
./scripts/rollback.sh k8s production --annotate "Emergency rollback due to high error rate"
Quick Rollback
Rollback to the last known-good version:
./scripts/rollback.sh quick production
View Rollback History
# View all rollbacks
./scripts/rollback.sh history
# View specific namespace rollbacks
./scripts/rollback.sh history | grep production
List Available Revisions
# List all revisions
./scripts/rollback.sh k8s-list production
# Detailed revision info with timestamps
./scripts/rollback.sh k8s-revisions production
Argo Rollout Rollback (Blue-Green Deployments)
For blue-green deployments using Argo Rollouts:
# Abort running rollout and rollback to previous version
./scripts/rollback.sh k8s-argo production
# Abort a running rollout without rollback
./scripts/rollback.sh k8s-argo-abort production
# Pause an active rollout
./scripts/rollback.sh k8s-argo-pause production
# Promote (full rollout) after analysis passes
./scripts/rollback.sh k8s-argo-promote production
# Get detailed Argo Rollout status
./scripts/rollback.sh argo-status production
Argo Rollout GitHub Actions Workflow
The production deployment workflow includes:
- pre-deployment-check: Validates current production health
- deploy-kubernetes-production: Deploys new version to preview (blue-green)
- health-check: Runs enhanced health checks on preview
- canary-analysis: Analyzes preview vs stable service metrics
- promote: Promotes preview to active after canary analysis passes
- post-promotion-monitor: Continues monitoring for 10 minutes post-promotion
- rollback-post-promotion: Auto-rollback if post-promotion monitoring fails
Rollback Criteria
Initiate rollback if ANY of:
| Metric | Threshold | Duration |
|---|---|---|
| Error Rate (5xx) | > 5% | > 5 min |
| Latency P99 | > 2000ms | > 5 min |
| API Health | Non-200 response | > 5 min |
| Pods Running | < Replicas - 1 | > 2 min |
| Circuit Breaker | 3 consecutive failures | - |
Appendix: Version Tags
Production deployments use semantic versioning:
# Create a release tag
git tag -a v1.2.3 -m "Release 1.2.3 - Add new features"
git push origin v1.2.3
# List tags
git tag -l
# Delete tag (if needed)
git tag -d v1.2.3
git push origin :refs/tags/v1.2.3
End of Document