← Back to documentation

deployment-runbook

BuyWhere Deployment Runbook

Status: Active Last Updated: 2026-04-19 Owner: DevOps (Pipe agent) Version: 2.0 Classification: Internal


Overview

This runbook covers deployment procedures for BuyWhere API services across staging and production environments. It includes pre-deployment checks, deployment steps, rollback procedures, and emergency contacts.


Environments

EnvironmentClusterAPI EndpointRegion
Stagingbuywhere-staginghttps://api-staging.buywhere.ioap-southeast-1
Productionbuywhere-prodhttps://api.buywhere.ioap-southeast-1

Deployment Strategies

Kubernetes (Primary)

  • Staging: Automatic on push to main/master branch
  • Production: Manual trigger on version tag (v*)

ECS (Legacy)

  • Staging: Automatic on push to main/master branch
  • Production: Manual trigger on version tag (v*)

Pre-Deployment Checklist

Complete before any deployment:

  • All CI checks passing (lint, typecheck, tests)
  • Backup created via scripts/pre-deploy-backup.sh backup
  • Rollback state captured in .rollback_state
  • Stakeholders notified (if production deployment)
  • On-call engineer available for 30 minutes post-deployment
  • Release notes prepared (for production)

Production-Specific Checks

  • Version tag created and verified
  • Change freeze not in effect
  • Database migrations reviewed (if any)
  • Feature flags configured appropriately
  • Runbook reviewed for any special steps

Staging Deployment (Kubernetes)

Automated (via GitHub Actions)

Push to main or master triggers automatic deployment:

# Triggered automatically on push to main
git push origin main

Manual Deployment

# 1. Create pre-deployment backup
./scripts/pre-deploy-backup.sh backup

# 2. Run enhanced health check (baseline)
./scripts/enhanced-health-check.sh

# 3. Deploy via kustomize
kubectl config use-context staging
cd k8s/staging
kustomize edit set image ghcr.io/buywhere/buywhere-api=<new-image>
kustomize build . | kubectl apply -f -

# 4. Monitor rollout
kubectl rollout status deployment/buywhere-api -n staging --timeout=300s

# 5. Post-deployment health check
./scripts/enhanced-health-check.sh

Verify Deployment

# Check pods
kubectl get pods -n staging -l app=buywhere-api

# Check rollout history
kubectl rollout history deployment/buywhere-api -n staging

# Test health endpoint
curl -sf https://api-staging.buywhere.io/health

# Run smoke tests
./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io

Production Deployment (Kubernetes)

Trigger Options

  1. Tag Push: Push a version tag v*

    git tag v1.2.3
    git push origin v1.2.3
    
  2. Workflow Dispatch: Go to GitHub Actions → Deploy to Kubernetes Production → Run workflow

Manual Deployment Steps

# 1. Create pre-deployment backup
./scripts/pre-deploy-backup.sh backup ENVIRONMENT=production AWS_CLUSTER=buywhere-prod

# 2. Verify rollback state
cat .rollback_state

# 3. Update production kustomization
cd k8s/production
kustomize edit set image ghcr.io/buywhere/buywhere-api=<image-tag>

# 4. Apply deployment
kubectl config use-context production
kustomize build . | kubectl apply -f -

# 5. Monitor rollout (may take up to 10 minutes)
kubectl rollout status deployment/buywhere-api -n production --timeout=600s

# 6. Enhanced health check
API_URL=https://api.buywhere.io ENVIRONMENT=production ./scripts/enhanced-health-check.sh

# 7. Run canary analysis
./scripts/canary-analysis.sh --namespace production --duration 300

Post-Deployment Verification

# Verify all pods healthy
kubectl get pods -n production -l app=buywhere-api

# Check service endpoints
kubectl get svc -n production -l app=buywhere-api

# Monitor error rates for 5 minutes
watch -n 5 'curl -sf https://api.buywhere.io/metrics | grep error_rate'

# Verify data freshness
curl -sf https://api.buywhere.io/health | python3 -m json.tool

Rollback Procedures

Automatic Rollback

GitHub Actions automatically rolls back if:

  • Health check fails after deployment
  • Canary analysis detects error rate > 5%
  • Latency P99 > 1000ms

Manual Rollback (Kubernetes)

# 1. Identify current and previous revision
kubectl rollout history deployment/buywhere-api -n production

# 2. Rollback to previous version
kubectl rollout undo deployment/buywhere-api -n production

# 3. Wait for rollback to complete
kubectl rollout status deployment/buywhere-api -n production --timeout=600s

# 4. Verify rollback
kubectl get pods -n production -l app=buywhere-api
curl -sf https://api.buywhere.io/health

Manual Rollback (ECS)

# 1. Source rollback state
source .rollback_state

# 2. Rollback to previous task definition
aws ecs update-service \
  --cluster buywhere-prod \
  --service buywhere-api \
  --task-definition "$PREV_TASK" \
  --force-new-deployment

# 3. Wait for stability
aws ecs wait services-stable \
  --cluster buywhere-prod \
  --services buywhere-api

# 4. Verify
curl -sf https://api.buywhere.io/health

Database Rollback

If database changes need to be rolled back:

# 1. List available backups
./scripts/pre-deploy-backup.sh list

# 2. Restore from backup
./scripts/pre-deploy-backup.sh restore backups/pre-deploy-<timestamp>.dump

Emergency Contacts

RoleContactEscalation
Primary On-CallPagerDutyImmediate
DevOps Lead (Bolt)@boltP1/P2
CTO (Rex)@ctoP1 only

Escalation Path

P1 (Critical):
  On-call → Bolt (10 min no response) → CTO (20 min no response)

P2 (High):
  On-call → Bolt (30 min no response)

P3 (Medium):
  On-call → Bolt (next business day)

Monitoring

Key Dashboards

DashboardURLPurpose
Grafana APIhttps://grafana.buywhere.io/d/api-mainAPI metrics
Loki Logshttps://grafana.buywhere.io/d/logsLog exploration
Deploymenthttps://github.com/buywhere/buywhere-api/actionsCI/CD status

Key Alerts

AlertSeverityAction
HighErrorRateCriticalPage on-call
APIServiceDownCriticalPage on-call
NoLogsReceivedWarningInvestigate

Scripts Reference

ScriptPurpose
scripts/enhanced-health-check.shComprehensive health validation
scripts/pre-deploy-backup.shDatabase backup and rollback state
scripts/pre-deploy-smoke-test.shAPI smoke tests
scripts/deployment-notify.shSlack notifications
scripts/canary-analysis.shProduction canary validation
scripts/deployment-monitor.shPost-deployment monitoring with circuit breaker
scripts/enhanced-health-check.shComprehensive health validation with dependencies
scripts/pre-deploy-backup.shDatabase backup and rollback state

Deployment Thresholds

Deployment behavior is configurable via config/deployment-thresholds.yaml:

EnvironmentError RateLatency P99Circuit BreakerMonitoring Duration
Staging10%2000ms2 failures300s
Production5%1500ms2 failures600s

Customizing Thresholds

Edit config/deployment-thresholds.yaml to adjust thresholds:

staging:
  error_rate_threshold: 10
  latency_p99_threshold_ms: 2000
  consecutive_failures_for_rollback: 2
  monitoring_duration_seconds: 300

Feature Flags

Feature Flag System

The API supports dynamic feature flags via:

  1. Environment Variables: FEATURE_<FLAG_NAME>=true
  2. Kubernetes ConfigMap: Synced every 30 seconds from buywhere-feature-flags ConfigMap
  3. Runtime Overrides: Via admin API endpoints

Available Flags

FlagDescriptionDefault
FEATURE_NEW_SEARCHNew search algorithmfalse
FEATURE_BULK_INGESTBulk product ingestionfalse
FEATURE_MCP_SERVERMCP server integrationtrue
FEATURE_ADVANCED_ANALYTICSAdvanced analytics dashboardfalse
FEATURE_EXPERIMENTAL_SCRAPERExperimental scraper featuresfalse

Managing Feature Flags

# Via API (admin access required)
GET /admin/feature-flags              # List all flags
GET /admin/feature-flags/{flag_name}  # Get specific flag
POST /admin/feature-flags/override    # Set runtime override
DELETE /admin/feature-flags/override/{flag_name}  # Clear override

Feature Flag SDK

from app.services.feature_flags import is_feature_enabled, require_feature_flag

# Check if feature is enabled
if is_feature_enabled("feature_new_search"):
    # Use new search

# Require feature for endpoint
@app.get("/experimental")
@require_feature_flag("feature_experimental_scraper")
async def experimental_endpoint():
    return {"status": "experimental"}

Post-Deployment Monitoring

Automated Monitoring (Post-Deployment)

After each deployment, automated monitoring runs for 5-10 minutes:

# Monitor deployment health
./scripts/deployment-monitor.sh monitor \
  --environment production \
  --api-url https://api.buywhere.io \
  --duration 600 \
  --check-interval 30

Circuit Breaker Pattern

The monitoring system implements a circuit breaker pattern:

  • Closed: Normal operation, deployments allowed
  • Open: Too many failures, deployments blocked
  • Threshold: 2 consecutive failures trips the breaker
  • Critical Failures: Immediate rollback triggered on 5xx errors
# Check circuit breaker status
./scripts/deployment-monitor.sh circuit-breaker

# Reset circuit breaker (if needed)
rm .deployment_state/circuit_breaker_production.json

Enhanced Health Checks

The enhanced health check validates:

  • Core endpoints (health, ready, metrics)
  • JSON field validation
  • Database connectivity
  • Redis connectivity
  • PgBouncer connectivity
  • Connection pool health
  • Startup time validation
# Run enhanced health check
./scripts/enhanced-health-check.sh \
  --api-url https://api.buywhere.io \
  --environment production

Check Current Deployment Status

# View current deployment health
./scripts/deployment-monitor.sh status

# View deployment history
./scripts/deployment-monitor.sh history

Enhanced Rollback Procedures

Standard Rollback (Kubernetes)

# Rollback to previous revision
./scripts/rollback.sh k8s production

# Rollback to specific revision
./scripts/rollback.sh k8s production --revision 3

# Rollback with annotation
./scripts/rollback.sh k8s production --annotate "Emergency rollback due to high error rate"

Quick Rollback

Rollback to the last known-good version:

./scripts/rollback.sh quick production

View Rollback History

# View all rollbacks
./scripts/rollback.sh history

# View specific namespace rollbacks
./scripts/rollback.sh history | grep production

List Available Revisions

# List all revisions
./scripts/rollback.sh k8s-list production

# Detailed revision info with timestamps
./scripts/rollback.sh k8s-revisions production

Argo Rollout Rollback (Blue-Green Deployments)

For blue-green deployments using Argo Rollouts:

# Abort running rollout and rollback to previous version
./scripts/rollback.sh k8s-argo production

# Abort a running rollout without rollback
./scripts/rollback.sh k8s-argo-abort production

# Pause an active rollout
./scripts/rollback.sh k8s-argo-pause production

# Promote (full rollout) after analysis passes
./scripts/rollback.sh k8s-argo-promote production

# Get detailed Argo Rollout status
./scripts/rollback.sh argo-status production

Argo Rollout GitHub Actions Workflow

The production deployment workflow includes:

  1. pre-deployment-check: Validates current production health
  2. deploy-kubernetes-production: Deploys new version to preview (blue-green)
  3. health-check: Runs enhanced health checks on preview
  4. canary-analysis: Analyzes preview vs stable service metrics
  5. promote: Promotes preview to active after canary analysis passes
  6. post-promotion-monitor: Continues monitoring for 10 minutes post-promotion
  7. rollback-post-promotion: Auto-rollback if post-promotion monitoring fails

Rollback Criteria

Initiate rollback if ANY of:

MetricThresholdDuration
Error Rate (5xx)> 5%> 5 min
Latency P99> 2000ms> 5 min
API HealthNon-200 response> 5 min
Pods Running< Replicas - 1> 2 min
Circuit Breaker3 consecutive failures-

Appendix: Version Tags

Production deployments use semantic versioning:

# Create a release tag
git tag -a v1.2.3 -m "Release 1.2.3 - Add new features"
git push origin v1.2.3

# List tags
git tag -l

# Delete tag (if needed)
git tag -d v1.2.3
git push origin :refs/tags/v1.2.3

End of Document