emergency_api_scaling_runbook

Emergency API Scaling Runbook

Status: Active Last Updated: 2026-04-18 Owner: DevOps (Ops agent)

Overview

This runbook covers emergency procedures for scaling the BuyWhere API during high traffic events, outages, or performance degradation. Follow these procedures when receiving alerts for high latency, error spikes, or capacity issues.

Quick Reference

Alert Type	Severity	Action
HighLatencyP95 (P95 >1s)	Warning	Monitor, prepare to scale
HighLatencyP99 (P99 >2.5s)	Critical	Scale up immediately
HighErrorRate (5xx >1%)	Critical	Scale up, investigate errors
DatabasePoolExhausted (90%+)	Warning	Scale up, check queries
APIDown	Critical	Immediate escalation

Architecture

                         ┌─────────────────┐
                         │     Route53     │
                         │   (DNS/LB)      │
                         └────────┬────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │                           │
             ┌──────▼──────┐           ┌───────▼──────┐
             │  ALB/Staging │           │  ALB/Production│
             └──────┬──────┘           └───────┬──────┘
                    │                           │
        ┌───────────┴───────────┐   ┌───────────┴───────────┐
        │                       │   │                       │
 ┌──────▼──────┐         ┌──────▼───┐              ┌──────▼──────┐
 │ ECS Service │         │ ECS Service│            │ ECS Service │
 │ (Fargate)  │         │ (Fargate) │            │ (Fargate)  │
 │  Instance  │         │  Instance  │            │  Instance N  │
 └─────────────┘         └───────────┘            └─────────────┘
        │                       │                       │
        └───────────────────────┼───────────────────────┘
                                │
                    ┌──────────▼──────────┐
                    │    PgBouncer        │
                    │  (Connection Pool)  │
                    └──────────┬──────────┘
                               │
               ┌───────────────┼───────────────┐
               │               │               │
        ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
        │  Primary DB │ │   Replica   │ │    Redis    │
        │  (Writer)   │ │  (Reader)   │ │   (Cache)   │
        └─────────────┘ └─────────────┘ └─────────────┘

Pre-requisites

AWS CLI configured with ECS access
ECS cluster: buywhere-staging or buywhere-production
Service name: buywhere-api
Autoscaling script: scripts/ecs-autoscaling.sh

Monitoring

Check Current Service Status

# Describe ECS service
aws ecs describe-services \
  --cluster buywhere-production \
  --services buywhere-api \
  --region ap-southeast-1

# List running tasks
aws ecs list-tasks \
  --cluster buywhere-production \
  --service-name buywhere-api \
  --region ap-southeast-1

Check Autoscaling Configuration

# View current scaling targets
aws application-autoscaling describe-scalable-targets \
  --service-namespace ecs \
  --resource-ids service/buywhere-production/buywhere-api \
  --region ap-southeast-1

# View scaling policies
aws application-autoscaling describe-scaling-policies \
  --service-namespace ecs \
  --resource-id service/buywhere-production/buywhere-api \
  --region ap-southeast-1

Check Metrics in Prometheus/Grafana

# Key metrics to check:
# - http_requests_total (request rate)
# - http_request_duration_seconds (latency)
# - http_errors_total (error rate)
# - db_connection_pool_checked_out (DB pool usage)

# Access Grafana at http://localhost:3000 (production) or via cloud console

Manual Health Check

# API health
curl -sf https://api.buywhere.ai/health | python3 -m json.tool

# MCP health
curl -sf https://api.buywhere.ai/mcp/v1/health | python3 -m json.tool

Emergency Scaling Procedures

Step 1: Assess the Situation

Before scaling, identify the problem:

# Check if it's a traffic spike vs. service degradation
curl -sf https://api.buywhere.ai/metrics | grep http_requests_total

# Check latency distribution
curl -sf https://api.buywhere.ai/metrics | grep http_request_duration_seconds

# Check error rate
curl -sf https://api.buywhere.ai/metrics | grep http_errors_total

Common Causes:

Traffic spike - Organic or bot traffic increase
Slow queries - Database performance issues
Memory pressure - OOM or garbage collection
Connection pool exhaustion - PgBouncer limits
Upstream dependency failure - Redis, database down

Step 2: Manual Scale-Out (Emergency)

If autoscaling is lagging or not configured:

# Get current desired count
CURRENT_COUNT=$(aws ecs describe-services \
  --cluster buywhere-production \
  --services buywhere-api \
  --query 'services[0].desiredCount' \
  --output text \
  --region ap-southeast-1)

# Scale up to handle traffic
NEW_COUNT=$((CURRENT_COUNT + 2))
aws ecs update-service \
  --cluster buywhere-production \
  --service buywhere-api \
  --desired-count $NEW_COUNT \
  --region ap-southeast-1

echo "Scaled from $CURRENT_COUNT to $NEW_COUNT"

Step 3: Adjust Autoscaling Targets (If Needed)

# Temporarily increase max capacity
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/buywhere-production/buywhere-api \
  --min-capacity 2 \
  --max-capacity 20 \
  --region ap-southeast-1

# Adjust CPU target lower for more aggressive scaling
aws application-autoscaling put-scaling-policy \
  --policy-name "buywhere-api-cpu-scaling-temp" \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/buywhere-production/buywhere-api \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 50,
    "ScaleInCooldown": 180,
    "ScaleOutCooldown": 30,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    }
  }' \
  --region ap-southeast-1

Step 4: Verify Scaling

# Wait for new tasks to start
sleep 30

# Check task status
aws ecs describe-services \
  --cluster buywhere-production \
  --services buywhere-api \
  --query 'services[0].{desiredCount:desiredCount,runningCount:runningCount,pendingCount:pendingCount}' \
  --region ap-southeast-1

# Verify API is responding
curl -sf https://api.buywhere.ai/health

# Check latency improved
curl -sf https://api.buywhere.ai/metrics | grep http_request_duration_seconds_bucket

PgBouncer Emergency Adjustments

If database connection pool is exhausted:

Step 1: Check PgBouncer Status

# Connect to PgBouncer admin
docker exec pgbouncer pgbouncer -h localhost -p 5432 -U buywhere -c "show pools"

# Show connection stats
docker exec pgbouncer pgbouncer -h localhost -p 5432 -U buywhere -c "show stats"

Step 2: Temporarily Increase Pool Size

Edit docker-compose.prod.yml or update environment:

Baseline settings (as of BUY-2881):

DEFAULT_POOL_SIZE: 100
MIN_POOL_SIZE: 20
RESERVE_POOL_SIZE: 30
MAX_CLIENT_CONN: 1000

# Emergency increase for extreme traffic
environment:
  DEFAULT_POOL_SIZE: "150"      # baseline: 100
  MAX_CLIENT_CONN: "2000"       # baseline: 1000
  RESERVE_POOL_SIZE: "40"       # baseline: 30

Then restart:

docker-compose -f docker-compose.prod.yml up -d pgbouncer

Redis Cache Emergency

If Redis is causing issues:

Check Redis Status

# Check Redis connectivity
docker exec redis redis-cli ping

# Check memory usage
docker exec redis redis-cli info memory

# Flush cache if corrupted (CAREFUL - this will slow down API)
docker exec redis redis-cli FLUSHALL

Testing Procedures in Staging

Before relying on scaling procedures in production, validate them in staging.

Prerequisites

Staging ECS cluster: buywhere-staging
Staging service: buywhere-api-staging
Load testing tool: Locust (locustfile.py)
AWS CLI configured with staging access

Step 1: Verify Staging Autoscaling is Configured

# Check current staging scaling configuration
AWS_CLUSTER=buywhere-staging \
AWS_REGION=ap-southeast-1 \
./scripts/ecs-autoscaling.sh

# Verify scalable targets exist
aws application-autoscaling describe-scalable-targets \
  --service-namespace ecs \
  --resource-ids service/buywhere-staging/buywhere-api-staging \
  --region ap-southeast-1

Step 2: Baseline Performance Test

Run a light load test to establish baseline:

# Start staging API if not running
aws ecs update-service \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --desired-count 2 \
  --region ap-southeast-1

# Wait for tasks to be healthy
sleep 60

# Run baseline load test (50 users, 30s)
BUYWHERE_API_KEY=${STAGING_API_KEY} \
locust -f locustfile.py \
  --host=https://staging.api.buywhere.ai \
  --users=50 --spawn-rate=5 --run-time=30s --headless \
  --csv=/tmp/staging_baseline

# Record baseline metrics:
# - Median response time
# - p95 response time
# - Error rate

Step 3: Test Manual Scale-Out

# Record current task count
CURRENT=$(aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].desiredCount' \
  --output text --region ap-southeast-1)

echo "Current desired count: $CURRENT"

# Manual scale-out to 4 tasks
aws ecs update-service \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --desired-count 4 \
  --region ap-southeast-1

# Monitor task startup
watch -n 10 "aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}' \
  --output table --region ap-southeast-1"

# Verify all tasks are RUNNING
aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].deployments' \
  --output json --region ap-southeast-1

Step 4: Test Load-Based Scale-Out (Autoscaling Trigger)

# Generate load to trigger CPU-based scaling
# Using hey or ab to simulate traffic spike

# Start sustained high load (60% CPU target should trigger scale-out)
BUYWHERE_API_KEY=${STAGING_API_KEY} \
locust -f locustfile.py \
  --host=https://staging.api.buywhere.ai \
  --users=200 --spawn-rate=20 --run-time=120s --headless \
  --csv=/tmp/staging_scaleout_test &

# Monitor scaling events
sleep 30

# Check if autoscaling triggered
aws application-autoscaling describe-scaling-activities \
  --service-namespace ecs \
  --resource-id service/buywhere-staging/buywhere-api-staging \
  --region ap-southeast-1

# Check current desired count (should increase)
aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].desiredCount' \
  --output text --region ap-southeast-1

# Kill locust after test
pkill -f "locust.*staging_scaleout"

Step 5: Verify Latency Improvement

# Run post-scale load test
BUYWHERE_API_KEY=${STAGING_API_KEY} \
locust -f locustfile.py \
  --host=https://staging.api.buywhere.ai \
  --users=200 --spawn-rate=20 --run-time=60s --headless \
  --csv=/tmp/staging_postscale

# Compare results:
# - Baseline p95: ~Xms
# - Post-scale p95: ~Yms (should be lower)
# - Error rate: <1%

Step 6: Test Scale-In (Cooldown Verification)

# After scale-out test, stop load and verify scale-in

# Wait for scale-out cooldown (60s) + scale-in cooldown (300s)
echo "Waiting 400 seconds for scale-in cooldown..."
sleep 400

# Check desired count (should return toward minimum)
aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].desiredCount' \
  --output text --region ap-southeast-1

Step 7: Test Rollback Procedure

# Get current task definition
TASK_DEF=$(aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].taskDefinition' \
  --output text --region ap-southeast-1)

echo "Current task definition: $TASK_DEF"

# Trigger rollback
aws ecs update-service \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --force-new-deployment \
  --region ap-southeast-1

# Verify new deployment starts
aws ecs describe-services \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --query 'services[0].deployments' \
  --output json --region ap-southeast-1

# Verify API still responds
curl -sf https://staging.api.buywhere.ai/health

Step 8: Reset Staging to Normal

After testing, reset staging to baseline configuration:

# Scale back to normal (1-2 tasks)
aws ecs update-service \
  --cluster buywhere-staging \
  --service buywhere-api-staging \
  --desired-count 2 \
  --region ap-southeast-1

# Verify normal operation
curl -sf https://staging.api.buywhere.ai/health | python3 -m json.tool

Staging Test Checklist

Autoscaling configured and verified
Baseline performance recorded
Manual scale-out successful (2 → 4 tasks)
Tasks reach RUNNING status
Autoscaling triggers on high load
Latency improves after scale-out
Scale-in occurs after cooldown
Rollback procedure works
Staging reset to normal after test

Scale-In Procedure (After Emergency)

Once traffic normalizes:

# Step 1: Scale back to normal levels
aws ecs update-service \
  --cluster buywhere-production \
  --service buywhere-api \
  --desired-count 3 \
  --region ap-southeast-1

# Step 2: Reset autoscaling to normal
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/buywhere-production/buywhere-api \
  --min-capacity 1 \
  --max-capacity 10 \
  --region ap-southeast-1

# Step 3: Reset CPU target
aws application-autoscaling put-scaling-policy \
  --policy-name "buywhere-api-cpu-scaling" \
  --service-namespace ecs \
  --scalable-dimension ecs:service:DesiredCount \
  --resource-id service/buywhere-production/buywhere-api \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70,
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    }
  }' \
  --region ap-southeast-1

# Step 4: Reset PgBouncer if changed
# Revert docker-compose.prod.yml pool settings

Rollback Procedure

If scaling causes issues:

# Get previous task definition
PREV_TASK=$(aws ecs describe-services \
  --cluster buywhere-production \
  --services buywhere-api \
  --query 'services[0].taskDefinition' \
  --output text \
  --region ap-southeast-1)

# Force new deployment with current task definition
aws ecs update-service \
  --cluster buywhere-production \
  --service buywhere-api \
  --task-definition "$PREV_TASK" \
  --force-new-deployment \
  --region ap-southeast-1

# Or rollback to specific count
aws ecs update-service \
  --cluster buywhere-production \
  --service buywhere-api \
  --desired-count 2 \
  --region ap-southeast-1

Alert Thresholds

Metric	Warning	Critical	Auto-Scale
CPU Utilization	>60%	>80%	Yes (target 70%)
P95 Latency	>1s	>2s	No
P99 Latency	>1.5s	>2.5s	Yes (target 80ms via ALB)
5xx Error Rate	>0.5%	>1%	No
DB Pool Usage	>70%	>90%	No
PgBouncer Pool Usage	>70%	>85%	No
PgBouncer Wait Time	>50ms	>100ms	No
Request Rate	>800/min	>1000/min	Yes

Files Reference

File	Purpose
`scripts/ecs-autoscaling.sh`	Configures ECS autoscaling policies
`ecs/buywhere-api-task-definition.json`	ECS task definition
`docker-compose.prod.yml`	Local production deployment
`prometheus_alerts.yml`	Alert rules and thresholds
`deploy.sh`	Deployment automation script

Contacts

Primary on-call: See PagerDuty schedule
DevOps Lead (Bolt): Via Paperclip
Infrastructure: infra@buywhere.ai

Appendix: ECS Autoscaling Script Usage

# Configure autoscaling for production
AWS_CLUSTER=buywhere-production \
AWS_REGION=ap-southeast-1 \
./scripts/ecs-autoscaling.sh

# Custom parameters
AWS_CLUSTER=buywhere-production \
ECS_SERVICE=buywhere-api \
MIN_CAPACITY=2 \
MAX_CAPACITY=15 \
CPU_TARGET=60 \
LATENCY_TARGET=60 \
./scripts/ecs-autoscaling.sh

Key settings:

Scale-out cooldown: 60s - Aggressive scale out
Scale-in cooldown: 300s - Conservative scale in
CPU target: 70% - Scale out before saturation
Latency target: 80ms - Scale based on ALB response time