deployment-runbook

BuyWhere Deployment Runbook

Status: Active Last Updated: 2026-04-19 Owner: DevOps (Pipe agent) Version: 2.0 Classification: Internal

Overview

This runbook covers deployment procedures for BuyWhere API services across staging and production environments. It includes pre-deployment checks, deployment steps, rollback procedures, and emergency contacts.

Environments

Environment	Cluster	API Endpoint	Region
Staging	buywhere-staging	https://api-staging.buywhere.io	ap-southeast-1
Production	buywhere-prod	https://api.buywhere.io	ap-southeast-1

Deployment Strategies

Kubernetes (Primary)

Staging: Automatic on push to main/master branch
Production: Manual trigger on version tag (v*)

ECS (Legacy)

Staging: Automatic on push to main/master branch
Production: Manual trigger on version tag (v*)

Pre-Deployment Checklist

Complete before any deployment:

All CI checks passing (lint, typecheck, tests)
Backup created via scripts/pre-deploy-backup.sh backup
Rollback state captured in .rollback_state
Stakeholders notified (if production deployment)
On-call engineer available for 30 minutes post-deployment
Release notes prepared (for production)

Production-Specific Checks

Version tag created and verified
Change freeze not in effect
Database migrations reviewed (if any)
Feature flags configured appropriately
Runbook reviewed for any special steps

Staging Deployment (Kubernetes)

Automated (via GitHub Actions)

Push to main or master triggers automatic deployment:

# Triggered automatically on push to main
git push origin main

Manual Deployment

# 1. Create pre-deployment backup
./scripts/pre-deploy-backup.sh backup

# 2. Run enhanced health check (baseline)
./scripts/enhanced-health-check.sh

# 3. Deploy via kustomize
kubectl config use-context staging
cd k8s/staging
kustomize edit set image ghcr.io/buywhere/buywhere-api=<new-image>
kustomize build . | kubectl apply -f -

# 4. Monitor rollout
kubectl rollout status deployment/buywhere-api -n staging --timeout=300s

# 5. Post-deployment health check
./scripts/enhanced-health-check.sh

Verify Deployment

# Check pods
kubectl get pods -n staging -l app=buywhere-api

# Check rollout history
kubectl rollout history deployment/buywhere-api -n staging

# Test health endpoint
curl -sf https://api-staging.buywhere.io/health

# Run smoke tests
./scripts/pre-deploy-smoke-test.sh --base-url https://api-staging.buywhere.io

Production Deployment (Kubernetes)

Trigger Options

Tag Push: Push a version tag v*
```
git tag v1.2.3
git push origin v1.2.3
```
Workflow Dispatch: Go to GitHub Actions → Deploy to Kubernetes Production → Run workflow

Manual Deployment Steps

# 1. Create pre-deployment backup
./scripts/pre-deploy-backup.sh backup ENVIRONMENT=production AWS_CLUSTER=buywhere-prod

# 2. Verify rollback state
cat .rollback_state

# 3. Update production kustomization
cd k8s/production
kustomize edit set image ghcr.io/buywhere/buywhere-api=<image-tag>

# 4. Apply deployment
kubectl config use-context production
kustomize build . | kubectl apply -f -

# 5. Monitor rollout (may take up to 10 minutes)
kubectl rollout status deployment/buywhere-api -n production --timeout=600s

# 6. Enhanced health check
API_URL=https://api.buywhere.io ENVIRONMENT=production ./scripts/enhanced-health-check.sh

# 7. Run canary analysis
./scripts/canary-analysis.sh --namespace production --duration 300

Post-Deployment Verification

# Verify all pods healthy
kubectl get pods -n production -l app=buywhere-api

# Check service endpoints
kubectl get svc -n production -l app=buywhere-api

# Monitor error rates for 5 minutes
watch -n 5 'curl -sf https://api.buywhere.io/metrics | grep error_rate'

# Verify data freshness
curl -sf https://api.buywhere.io/health | python3 -m json.tool

Rollback Procedures

Automatic Rollback

GitHub Actions automatically rolls back if:

Health check fails after deployment
Canary analysis detects error rate > 5%
Latency P99 > 1000ms

Manual Rollback (Kubernetes)

# 1. Identify current and previous revision
kubectl rollout history deployment/buywhere-api -n production

# 2. Rollback to previous version
kubectl rollout undo deployment/buywhere-api -n production

# 3. Wait for rollback to complete
kubectl rollout status deployment/buywhere-api -n production --timeout=600s

# 4. Verify rollback
kubectl get pods -n production -l app=buywhere-api
curl -sf https://api.buywhere.io/health

Manual Rollback (ECS)

# 1. Source rollback state
source .rollback_state

# 2. Rollback to previous task definition
aws ecs update-service \
  --cluster buywhere-prod \
  --service buywhere-api \
  --task-definition "$PREV_TASK" \
  --force-new-deployment

# 3. Wait for stability
aws ecs wait services-stable \
  --cluster buywhere-prod \
  --services buywhere-api

# 4. Verify
curl -sf https://api.buywhere.io/health

Database Rollback

If database changes need to be rolled back:

# 1. List available backups
./scripts/pre-deploy-backup.sh list

# 2. Restore from backup
./scripts/pre-deploy-backup.sh restore backups/pre-deploy-<timestamp>.dump

Emergency Contacts

Role	Contact	Escalation
Primary On-Call	PagerDuty	Immediate
DevOps Lead (Bolt)	@bolt	P1/P2
CTO (Rex)	@cto	P1 only

Escalation Path

P1 (Critical):
  On-call → Bolt (10 min no response) → CTO (20 min no response)

P2 (High):
  On-call → Bolt (30 min no response)

P3 (Medium):
  On-call → Bolt (next business day)

Monitoring

Key Dashboards

Dashboard	URL	Purpose
Grafana API	https://grafana.buywhere.io/d/api-main	API metrics
Loki Logs	https://grafana.buywhere.io/d/logs	Log exploration
Deployment	https://github.com/buywhere/buywhere-api/actions	CI/CD status

Key Alerts

Alert	Severity	Action
HighErrorRate	Critical	Page on-call
APIServiceDown	Critical	Page on-call
NoLogsReceived	Warning	Investigate

Scripts Reference

Script	Purpose
`scripts/enhanced-health-check.sh`	Comprehensive health validation
`scripts/pre-deploy-backup.sh`	Database backup and rollback state
`scripts/pre-deploy-smoke-test.sh`	API smoke tests
`scripts/deployment-notify.sh`	Slack notifications
`scripts/canary-analysis.sh`	Production canary validation
`scripts/deployment-monitor.sh`	Post-deployment monitoring with circuit breaker
`scripts/enhanced-health-check.sh`	Comprehensive health validation with dependencies
`scripts/pre-deploy-backup.sh`	Database backup and rollback state

Deployment Thresholds

Deployment behavior is configurable via config/deployment-thresholds.yaml:

Environment	Error Rate	Latency P99	Circuit Breaker	Monitoring Duration
Staging	10%	2000ms	2 failures	300s
Production	5%	1500ms	2 failures	600s

Customizing Thresholds

Edit config/deployment-thresholds.yaml to adjust thresholds:

staging:
  error_rate_threshold: 10
  latency_p99_threshold_ms: 2000
  consecutive_failures_for_rollback: 2
  monitoring_duration_seconds: 300

Feature Flags

Feature Flag System

The API supports dynamic feature flags via:

Environment Variables: FEATURE_<FLAG_NAME>=true
Kubernetes ConfigMap: Synced every 30 seconds from buywhere-feature-flags ConfigMap
Runtime Overrides: Via admin API endpoints

Available Flags

Flag	Description	Default
`FEATURE_NEW_SEARCH`	New search algorithm	false
`FEATURE_BULK_INGEST`	Bulk product ingestion	false
`FEATURE_MCP_SERVER`	MCP server integration	true
`FEATURE_ADVANCED_ANALYTICS`	Advanced analytics dashboard	false
`FEATURE_EXPERIMENTAL_SCRAPER`	Experimental scraper features	false

Managing Feature Flags

# Via API (admin access required)
GET /admin/feature-flags              # List all flags
GET /admin/feature-flags/{flag_name}  # Get specific flag
POST /admin/feature-flags/override    # Set runtime override
DELETE /admin/feature-flags/override/{flag_name}  # Clear override

Feature Flag SDK

from app.services.feature_flags import is_feature_enabled, require_feature_flag

# Check if feature is enabled
if is_feature_enabled("feature_new_search"):
    # Use new search

# Require feature for endpoint
@app.get("/experimental")
@require_feature_flag("feature_experimental_scraper")
async def experimental_endpoint():
    return {"status": "experimental"}

Post-Deployment Monitoring

Automated Monitoring (Post-Deployment)

After each deployment, automated monitoring runs for 5-10 minutes:

# Monitor deployment health
./scripts/deployment-monitor.sh monitor \
  --environment production \
  --api-url https://api.buywhere.io \
  --duration 600 \
  --check-interval 30

Circuit Breaker Pattern

The monitoring system implements a circuit breaker pattern:

Closed: Normal operation, deployments allowed
Open: Too many failures, deployments blocked
Threshold: 2 consecutive failures trips the breaker
Critical Failures: Immediate rollback triggered on 5xx errors

# Check circuit breaker status
./scripts/deployment-monitor.sh circuit-breaker

# Reset circuit breaker (if needed)
rm .deployment_state/circuit_breaker_production.json

Enhanced Health Checks

The enhanced health check validates:

Core endpoints (health, ready, metrics)
JSON field validation
Database connectivity
Redis connectivity
PgBouncer connectivity
Connection pool health
Startup time validation

# Run enhanced health check
./scripts/enhanced-health-check.sh \
  --api-url https://api.buywhere.io \
  --environment production

Check Current Deployment Status

# View current deployment health
./scripts/deployment-monitor.sh status

# View deployment history
./scripts/deployment-monitor.sh history

Enhanced Rollback Procedures

Standard Rollback (Kubernetes)

# Rollback to previous revision
./scripts/rollback.sh k8s production

# Rollback to specific revision
./scripts/rollback.sh k8s production --revision 3

# Rollback with annotation
./scripts/rollback.sh k8s production --annotate "Emergency rollback due to high error rate"

Quick Rollback

Rollback to the last known-good version:

./scripts/rollback.sh quick production

View Rollback History

# View all rollbacks
./scripts/rollback.sh history

# View specific namespace rollbacks
./scripts/rollback.sh history | grep production

List Available Revisions

# List all revisions
./scripts/rollback.sh k8s-list production

# Detailed revision info with timestamps
./scripts/rollback.sh k8s-revisions production

Argo Rollout Rollback (Blue-Green Deployments)

For blue-green deployments using Argo Rollouts:

# Abort running rollout and rollback to previous version
./scripts/rollback.sh k8s-argo production

# Abort a running rollout without rollback
./scripts/rollback.sh k8s-argo-abort production

# Pause an active rollout
./scripts/rollback.sh k8s-argo-pause production

# Promote (full rollout) after analysis passes
./scripts/rollback.sh k8s-argo-promote production

# Get detailed Argo Rollout status
./scripts/rollback.sh argo-status production

Argo Rollout GitHub Actions Workflow

The production deployment workflow includes:

pre-deployment-check: Validates current production health
deploy-kubernetes-production: Deploys new version to preview (blue-green)
health-check: Runs enhanced health checks on preview
canary-analysis: Analyzes preview vs stable service metrics
promote: Promotes preview to active after canary analysis passes
post-promotion-monitor: Continues monitoring for 10 minutes post-promotion
rollback-post-promotion: Auto-rollback if post-promotion monitoring fails

Rollback Criteria

Initiate rollback if ANY of:

Metric	Threshold	Duration
Error Rate (5xx)	> 5%	> 5 min
Latency P99	> 2000ms	> 5 min
API Health	Non-200 response	> 5 min
Pods Running	< Replicas - 1	> 2 min
Circuit Breaker	3 consecutive failures	-

Appendix: Version Tags

Production deployments use semantic versioning:

# Create a release tag
git tag -a v1.2.3 -m "Release 1.2.3 - Add new features"
git push origin v1.2.3

# List tags
git tag -l

# Delete tag (if needed)
git tag -d v1.2.3
git push origin :refs/tags/v1.2.3

End of Document