Emergency API Scaling Runbook
Status: Active Last Updated: 2026-04-18 Owner: DevOps (Ops agent)
Overview
This runbook covers emergency procedures for scaling the BuyWhere API during high traffic events, outages, or performance degradation. Follow these procedures when receiving alerts for high latency, error spikes, or capacity issues.
Quick Reference
| Alert Type | Severity | Action |
|---|---|---|
| HighLatencyP95 (P95 >1s) | Warning | Monitor, prepare to scale |
| HighLatencyP99 (P99 >2.5s) | Critical | Scale up immediately |
| HighErrorRate (5xx >1%) | Critical | Scale up, investigate errors |
| DatabasePoolExhausted (90%+) | Warning | Scale up, check queries |
| APIDown | Critical | Immediate escalation |
Architecture
┌─────────────────┐
│ Route53 │
│ (DNS/LB) │
└────────┬────────┘
│
┌─────────────┴─────────────┐
│ │
┌──────▼──────┐ ┌───────▼──────┐
│ ALB/Staging │ │ ALB/Production│
└──────┬──────┘ └───────┬──────┘
│ │
┌───────────┴───────────┐ ┌───────────┴───────────┐
│ │ │ │
┌──────▼──────┐ ┌──────▼───┐ ┌──────▼──────┐
│ ECS Service │ │ ECS Service│ │ ECS Service │
│ (Fargate) │ │ (Fargate) │ │ (Fargate) │
│ Instance │ │ Instance │ │ Instance N │
└─────────────┘ └───────────┘ └─────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌──────────▼──────────┐
│ PgBouncer │
│ (Connection Pool) │
└──────────┬──────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Primary DB │ │ Replica │ │ Redis │
│ (Writer) │ │ (Reader) │ │ (Cache) │
└─────────────┘ └─────────────┘ └─────────────┘
Pre-requisites
- AWS CLI configured with ECS access
- ECS cluster:
buywhere-stagingorbuywhere-production - Service name:
buywhere-api - Autoscaling script:
scripts/ecs-autoscaling.sh
Monitoring
Check Current Service Status
# Describe ECS service
aws ecs describe-services \
--cluster buywhere-production \
--services buywhere-api \
--region ap-southeast-1
# List running tasks
aws ecs list-tasks \
--cluster buywhere-production \
--service-name buywhere-api \
--region ap-southeast-1
Check Autoscaling Configuration
# View current scaling targets
aws application-autoscaling describe-scalable-targets \
--service-namespace ecs \
--resource-ids service/buywhere-production/buywhere-api \
--region ap-southeast-1
# View scaling policies
aws application-autoscaling describe-scaling-policies \
--service-namespace ecs \
--resource-id service/buywhere-production/buywhere-api \
--region ap-southeast-1
Check Metrics in Prometheus/Grafana
# Key metrics to check:
# - http_requests_total (request rate)
# - http_request_duration_seconds (latency)
# - http_errors_total (error rate)
# - db_connection_pool_checked_out (DB pool usage)
# Access Grafana at http://localhost:3000 (production) or via cloud console
Manual Health Check
# API health
curl -sf https://api.buywhere.ai/health | python3 -m json.tool
# MCP health
curl -sf https://api.buywhere.ai/mcp/v1/health | python3 -m json.tool
Emergency Scaling Procedures
Step 1: Assess the Situation
Before scaling, identify the problem:
# Check if it's a traffic spike vs. service degradation
curl -sf https://api.buywhere.ai/metrics | grep http_requests_total
# Check latency distribution
curl -sf https://api.buywhere.ai/metrics | grep http_request_duration_seconds
# Check error rate
curl -sf https://api.buywhere.ai/metrics | grep http_errors_total
Common Causes:
- Traffic spike - Organic or bot traffic increase
- Slow queries - Database performance issues
- Memory pressure - OOM or garbage collection
- Connection pool exhaustion - PgBouncer limits
- Upstream dependency failure - Redis, database down
Step 2: Manual Scale-Out (Emergency)
If autoscaling is lagging or not configured:
# Get current desired count
CURRENT_COUNT=$(aws ecs describe-services \
--cluster buywhere-production \
--services buywhere-api \
--query 'services[0].desiredCount' \
--output text \
--region ap-southeast-1)
# Scale up to handle traffic
NEW_COUNT=$((CURRENT_COUNT + 2))
aws ecs update-service \
--cluster buywhere-production \
--service buywhere-api \
--desired-count $NEW_COUNT \
--region ap-southeast-1
echo "Scaled from $CURRENT_COUNT to $NEW_COUNT"
Step 3: Adjust Autoscaling Targets (If Needed)
# Temporarily increase max capacity
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/buywhere-production/buywhere-api \
--min-capacity 2 \
--max-capacity 20 \
--region ap-southeast-1
# Adjust CPU target lower for more aggressive scaling
aws application-autoscaling put-scaling-policy \
--policy-name "buywhere-api-cpu-scaling-temp" \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/buywhere-production/buywhere-api \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 50,
"ScaleInCooldown": 180,
"ScaleOutCooldown": 30,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
}
}' \
--region ap-southeast-1
Step 4: Verify Scaling
# Wait for new tasks to start
sleep 30
# Check task status
aws ecs describe-services \
--cluster buywhere-production \
--services buywhere-api \
--query 'services[0].{desiredCount:desiredCount,runningCount:runningCount,pendingCount:pendingCount}' \
--region ap-southeast-1
# Verify API is responding
curl -sf https://api.buywhere.ai/health
# Check latency improved
curl -sf https://api.buywhere.ai/metrics | grep http_request_duration_seconds_bucket
PgBouncer Emergency Adjustments
If database connection pool is exhausted:
Step 1: Check PgBouncer Status
# Connect to PgBouncer admin
docker exec pgbouncer pgbouncer -h localhost -p 5432 -U buywhere -c "show pools"
# Show connection stats
docker exec pgbouncer pgbouncer -h localhost -p 5432 -U buywhere -c "show stats"
Step 2: Temporarily Increase Pool Size
Edit docker-compose.prod.yml or update environment:
Baseline settings (as of BUY-2881):
- DEFAULT_POOL_SIZE: 100
- MIN_POOL_SIZE: 20
- RESERVE_POOL_SIZE: 30
- MAX_CLIENT_CONN: 1000
# Emergency increase for extreme traffic
environment:
DEFAULT_POOL_SIZE: "150" # baseline: 100
MAX_CLIENT_CONN: "2000" # baseline: 1000
RESERVE_POOL_SIZE: "40" # baseline: 30
Then restart:
docker-compose -f docker-compose.prod.yml up -d pgbouncer
Redis Cache Emergency
If Redis is causing issues:
Check Redis Status
# Check Redis connectivity
docker exec redis redis-cli ping
# Check memory usage
docker exec redis redis-cli info memory
# Flush cache if corrupted (CAREFUL - this will slow down API)
docker exec redis redis-cli FLUSHALL
Testing Procedures in Staging
Before relying on scaling procedures in production, validate them in staging.
Prerequisites
- Staging ECS cluster:
buywhere-staging - Staging service:
buywhere-api-staging - Load testing tool: Locust (
locustfile.py) - AWS CLI configured with staging access
Step 1: Verify Staging Autoscaling is Configured
# Check current staging scaling configuration
AWS_CLUSTER=buywhere-staging \
AWS_REGION=ap-southeast-1 \
./scripts/ecs-autoscaling.sh
# Verify scalable targets exist
aws application-autoscaling describe-scalable-targets \
--service-namespace ecs \
--resource-ids service/buywhere-staging/buywhere-api-staging \
--region ap-southeast-1
Step 2: Baseline Performance Test
Run a light load test to establish baseline:
# Start staging API if not running
aws ecs update-service \
--cluster buywhere-staging \
--service buywhere-api-staging \
--desired-count 2 \
--region ap-southeast-1
# Wait for tasks to be healthy
sleep 60
# Run baseline load test (50 users, 30s)
BUYWHERE_API_KEY=${STAGING_API_KEY} \
locust -f locustfile.py \
--host=https://staging.api.buywhere.ai \
--users=50 --spawn-rate=5 --run-time=30s --headless \
--csv=/tmp/staging_baseline
# Record baseline metrics:
# - Median response time
# - p95 response time
# - Error rate
Step 3: Test Manual Scale-Out
# Record current task count
CURRENT=$(aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].desiredCount' \
--output text --region ap-southeast-1)
echo "Current desired count: $CURRENT"
# Manual scale-out to 4 tasks
aws ecs update-service \
--cluster buywhere-staging \
--service buywhere-api-staging \
--desired-count 4 \
--region ap-southeast-1
# Monitor task startup
watch -n 10 "aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}' \
--output table --region ap-southeast-1"
# Verify all tasks are RUNNING
aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].deployments' \
--output json --region ap-southeast-1
Step 4: Test Load-Based Scale-Out (Autoscaling Trigger)
# Generate load to trigger CPU-based scaling
# Using hey or ab to simulate traffic spike
# Start sustained high load (60% CPU target should trigger scale-out)
BUYWHERE_API_KEY=${STAGING_API_KEY} \
locust -f locustfile.py \
--host=https://staging.api.buywhere.ai \
--users=200 --spawn-rate=20 --run-time=120s --headless \
--csv=/tmp/staging_scaleout_test &
# Monitor scaling events
sleep 30
# Check if autoscaling triggered
aws application-autoscaling describe-scaling-activities \
--service-namespace ecs \
--resource-id service/buywhere-staging/buywhere-api-staging \
--region ap-southeast-1
# Check current desired count (should increase)
aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].desiredCount' \
--output text --region ap-southeast-1
# Kill locust after test
pkill -f "locust.*staging_scaleout"
Step 5: Verify Latency Improvement
# Run post-scale load test
BUYWHERE_API_KEY=${STAGING_API_KEY} \
locust -f locustfile.py \
--host=https://staging.api.buywhere.ai \
--users=200 --spawn-rate=20 --run-time=60s --headless \
--csv=/tmp/staging_postscale
# Compare results:
# - Baseline p95: ~Xms
# - Post-scale p95: ~Yms (should be lower)
# - Error rate: <1%
Step 6: Test Scale-In (Cooldown Verification)
# After scale-out test, stop load and verify scale-in
# Wait for scale-out cooldown (60s) + scale-in cooldown (300s)
echo "Waiting 400 seconds for scale-in cooldown..."
sleep 400
# Check desired count (should return toward minimum)
aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].desiredCount' \
--output text --region ap-southeast-1
Step 7: Test Rollback Procedure
# Get current task definition
TASK_DEF=$(aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].taskDefinition' \
--output text --region ap-southeast-1)
echo "Current task definition: $TASK_DEF"
# Trigger rollback
aws ecs update-service \
--cluster buywhere-staging \
--service buywhere-api-staging \
--force-new-deployment \
--region ap-southeast-1
# Verify new deployment starts
aws ecs describe-services \
--cluster buywhere-staging \
--service buywhere-api-staging \
--query 'services[0].deployments' \
--output json --region ap-southeast-1
# Verify API still responds
curl -sf https://staging.api.buywhere.ai/health
Step 8: Reset Staging to Normal
After testing, reset staging to baseline configuration:
# Scale back to normal (1-2 tasks)
aws ecs update-service \
--cluster buywhere-staging \
--service buywhere-api-staging \
--desired-count 2 \
--region ap-southeast-1
# Verify normal operation
curl -sf https://staging.api.buywhere.ai/health | python3 -m json.tool
Staging Test Checklist
- Autoscaling configured and verified
- Baseline performance recorded
- Manual scale-out successful (2 → 4 tasks)
- Tasks reach RUNNING status
- Autoscaling triggers on high load
- Latency improves after scale-out
- Scale-in occurs after cooldown
- Rollback procedure works
- Staging reset to normal after test
Scale-In Procedure (After Emergency)
Once traffic normalizes:
# Step 1: Scale back to normal levels
aws ecs update-service \
--cluster buywhere-production \
--service buywhere-api \
--desired-count 3 \
--region ap-southeast-1
# Step 2: Reset autoscaling to normal
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/buywhere-production/buywhere-api \
--min-capacity 1 \
--max-capacity 10 \
--region ap-southeast-1
# Step 3: Reset CPU target
aws application-autoscaling put-scaling-policy \
--policy-name "buywhere-api-cpu-scaling" \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/buywhere-production/buywhere-api \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70,
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
}
}' \
--region ap-southeast-1
# Step 4: Reset PgBouncer if changed
# Revert docker-compose.prod.yml pool settings
Rollback Procedure
If scaling causes issues:
# Get previous task definition
PREV_TASK=$(aws ecs describe-services \
--cluster buywhere-production \
--services buywhere-api \
--query 'services[0].taskDefinition' \
--output text \
--region ap-southeast-1)
# Force new deployment with current task definition
aws ecs update-service \
--cluster buywhere-production \
--service buywhere-api \
--task-definition "$PREV_TASK" \
--force-new-deployment \
--region ap-southeast-1
# Or rollback to specific count
aws ecs update-service \
--cluster buywhere-production \
--service buywhere-api \
--desired-count 2 \
--region ap-southeast-1
Alert Thresholds
| Metric | Warning | Critical | Auto-Scale |
|---|---|---|---|
| CPU Utilization | >60% | >80% | Yes (target 70%) |
| P95 Latency | >1s | >2s | No |
| P99 Latency | >1.5s | >2.5s | Yes (target 80ms via ALB) |
| 5xx Error Rate | >0.5% | >1% | No |
| DB Pool Usage | >70% | >90% | No |
| PgBouncer Pool Usage | >70% | >85% | No |
| PgBouncer Wait Time | >50ms | >100ms | No |
| Request Rate | >800/min | >1000/min | Yes |
Files Reference
| File | Purpose |
|---|---|
scripts/ecs-autoscaling.sh | Configures ECS autoscaling policies |
ecs/buywhere-api-task-definition.json | ECS task definition |
docker-compose.prod.yml | Local production deployment |
prometheus_alerts.yml | Alert rules and thresholds |
deploy.sh | Deployment automation script |
Contacts
- Primary on-call: See PagerDuty schedule
- DevOps Lead (Bolt): Via Paperclip
- Infrastructure: infra@buywhere.ai
Appendix: ECS Autoscaling Script Usage
# Configure autoscaling for production
AWS_CLUSTER=buywhere-production \
AWS_REGION=ap-southeast-1 \
./scripts/ecs-autoscaling.sh
# Custom parameters
AWS_CLUSTER=buywhere-production \
ECS_SERVICE=buywhere-api \
MIN_CAPACITY=2 \
MAX_CAPACITY=15 \
CPU_TARGET=60 \
LATENCY_TARGET=60 \
./scripts/ecs-autoscaling.sh
Key settings:
- Scale-out cooldown: 60s - Aggressive scale out
- Scale-in cooldown: 300s - Conservative scale in
- CPU target: 70% - Scale out before saturation
- Latency target: 80ms - Scale based on ALB response time