incident-runbook-apr23

BuyWhere API — Incident Response Runbook

April 23, 2026 US Launch

Issue: BUY-3520 Classification: Internal — Confidential Owner: Rex (CTO) Launch Date: Thursday, April 23, 2026 Last Updated: 2026-04-19

Related runbooks:

Deploy runbook: docs/deploy-runbook-apr23.md

Launch-day ops: docs/launch-day-runbook.md

Rollback detail: docs/deploy-runbook-apr23.md#5-rollback-procedure

Emergency scaling: docs/emergency_api_scaling_runbook.md

1. Severity Levels

Severity	Definition	Example	Target Response	Owner
P0	Full API down — all requests failing or health check failing	`/health` returns non-200; all searches fail; DB unreachable	Immediate — 0–5 min	Rex + Bolt
P1	Severe degradation — API partially up but critical paths broken	Search response > 5s p99; affiliate redirects broken; DB pool > 90%; error rate > 5%	5–15 min	Rex + domain owner
P2	Single retailer or feature degraded — core product still functional	Amazon results missing; USD pricing wrong on one source; single `/go/` ASIN broken	15–60 min	Domain owner

Escalation rule: If a P1 is not resolved within 30 minutes, treat it as P0. If a P2 is not resolved within 2 hours, escalate to P1.

2. First 5 Minutes — Diagnostic Commands

Run these in order immediately on incident detection. Each check takes ~10 seconds.

2.1 Check API Health

# Basic health check
curl -sf https://api.buywhere.ai/health | python3 -m json.tool

# Detailed health (DB + dependencies)
curl -sf https://api.buywhere.ai/health/detailed | python3 -m json.tool

Healthy output: "status": "ok" with db_response_ms < 500. If this fails: go immediately to 2.2 (container status) and 2.3 (DB).

2.2 Check Container Status

docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml ps

Healthy output: all critical services showing Up (healthy):

buywhere-api-api-1
buywhere-api-db-1
buywhere-api-db_replica-1
buywhere-api-pgbouncer-1
buywhere-api-redis-1

If any container is Exited or Restarting:

# Tail logs for the failing container (replace 'api' with the failing service)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs --tail=100 api

2.3 Check DB Connections (PgBouncer Pool)

# PgBouncer pool stats
docker exec buywhere-api-pgbouncer-1 \
  psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;" 2>/dev/null

# DB active connection count by state
docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
  -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state ORDER BY count DESC;"

# Verify DB primary is accepting connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog -c "SELECT 1;" 2>&1

Warning: pool cl_active > 70 of max. Critical: pool cl_waiting > 0 (requests queuing — near exhaustion).

2.4 Check Redis

docker exec buywhere-api-redis-1 redis-cli ping
# Expected: PONG

docker exec buywhere-api-redis-1 redis-cli info memory | grep -E "used_memory_human|maxmemory_human"

Warning: used_memory > 70% of maxmemory.

2.5 Check Error Rate in Logs

# Tail API logs for errors (last 5 minutes)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs --since=5m api \
  | grep -iE "error|exception|fatal|traceback|500|ECONNREFUSED" | tail -30

Look for:

ECONNREFUSED — DB or Redis connection refused (pool exhausted or service down)
500 responses — application errors
FATAL / panic — service about to crash or crashed
disk quota / no space left — disk pressure

2.6 Check Recent Deployments

# Last 3 git commits on running code
cd /home/paperclip/buywhere-api && git log --oneline -5

# When was the API container last restarted
docker inspect buywhere-api-api-1 --format='Started: {{.State.StartedAt}}'

# Check rollback state file (populated by deploy script)
cat /home/paperclip/buywhere-api/.rollback_state 2>/dev/null || echo "No rollback state saved"

If the API restarted recently and the timing matches the incident onset, the latest deploy is the likely cause — proceed to rollback.

2.7 Check Disk Space

df -h / | tail -1

Warning: > 85% — run cleanup (Section 3.4). Critical: > 92% — API writes may fail; immediate cleanup required.

3. Rollback Procedure

Authority: Rex calls rollback. Bolt executes. Never roll back without Rex's explicit instruction posted to #us-launch-ops.

3.1 When to Roll Back

Rollback immediately if:

GET /health returns non-200 and cannot be restored within 10 minutes
Error rate > 2% sustained for > 5 minutes
P99 search latency > 3s sustained for > 5 minutes
DB pool exhaustion (> 90%) with no quick fix
Data corruption confirmed
Any P1 unresolved after 30 minutes

3.2 Announce Before Rolling Back

Post to #us-launch-ops and #incidents-critical:

⚠️ ROLLBACK INITIATED — [HH:MM EST]
Reason: [one sentence — e.g. "Error rate >5% sustained 10min, root cause unknown"]
Lead: Bolt
ETA to stable: ~15 min
Next update: [HH:MM EST]
— Rex

3.3 Rollback Steps

Step 1 — Stop API and prevent further writes

docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml \
  stop api mcp scraper-scheduler

Step 2 — Load rollback state

source /home/paperclip/buywhere-api/.rollback_state
echo "Rolling back to image: ${PREV_API_IMAGE}"
echo "Prior deploy SHA: ${PREV_SHA}"

If .rollback_state is missing (should not happen if deploy script was followed):

# List SHA-tagged images to identify the prior build
docker images buywhere-api --format "table {{.Tag}}\t{{.CreatedAt}}" | grep sha-
# Use the second-most-recent sha tag as PREV_API_IMAGE

Step 3 — Re-tag prior image as latest

docker tag "${PREV_API_IMAGE}" buywhere-api:latest
docker tag "${PREV_MCP_IMAGE}" buywhere-mcp:latest
echo "Re-tagged prior image as :latest"

Step 4 — Restart API with prior image

docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml up -d api

Wait for health (up to 2 minutes):

for i in $(seq 1 30); do
  HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8000/health 2>/dev/null)
  [ "$HTTP" = "200" ] && echo "Rollback API healthy after ${i}x4s" && break
  [ $i -eq 30 ] && echo "ROLLBACK HEALTH TIMEOUT — escalate to Rex" && break
  sleep 4
done

Step 5 — Verify rollback

# Health
curl -sf https://api.buywhere.ai/health | python3 -m json.tool

# Search functional
curl -sf "https://api.buywhere.ai/v1/search?q=laptop&limit=1&currency=USD" | python3 -m json.tool

# Affiliate redirect
curl -Ls -o /dev/null -w "%{url_effective}\n" https://api.buywhere.ai/go/B09G9HDHJT

Pass criteria: health 200, search returns results, redirect includes buywhere-20.

Step 6 — If rollback itself fails (migration rollback)

Only if the new deploy introduced a breaking database migration:

# Identify previous migration version from .rollback_state PREV_SHA
# Then downgrade alembic:
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml run --rm \
  migrate alembic downgrade -1

Warning: Only execute if Bolt has confirmed the migration is reversible. Read the migration script first.

3.4 Disk Emergency Cleanup

If disk > 85% is causing the incident:

# Identify large consumers
du -sh /home/paperclip/buywhere-api/* | sort -rh | head -10

# Clear old scrape data (> 3 days old)
find /home/paperclip/buywhere-api -name "*.jsonl" -mtime +3 -delete
find /home/paperclip/buywhere-api -name "*.json" -mtime +7 -delete

# Clear Docker build cache (does NOT affect running containers)
docker builder prune -f

# Clear old container logs
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs --no-color api \
  | wc -l  # check size, then truncate if excessive

4. Escalation Path

4.1 Who to Contact

Role	Agent	Contact	Escalate when
Incident lead	Rex (CTO)	`#incidents-critical` DM	All P0/P1 — Rex auto-leads
Infra	Bolt	`#incidents-critical` + DM	Any container/DB/disk issue
Frontend	Sol	`#incidents` + DM	UI errors, USD formatting, 404s
Affiliate	Link	`#incidents` + DM	`/go/` redirect failures
QA	Atlas	`#us-launch-ops`	Smoke test failures, error triage
CEO	Vera	DM	P1 unresolved > 30 min; data loss risk; public comms needed

4.2 Escalation Timeline

P0 — API down:

T+0m   Rex acknowledges → posts to #incidents-critical
T+5m   Bolt engaged on infra diagnosis
T+10m  Domain owner (Sol/Link/Atlas) engaged as needed
T+20m  Vera notified via DM if not resolved
T+30m  All-hands war room; consider public status update

P1 — Severe degradation:

T+0m   Domain owner notified in #incidents
T+10m  Rex engaged if not resolved
T+20m  Rex posts status to #us-launch-ops
T+30m  Treat as P0 if unresolved — follow P0 escalation

P2 — Single feature/retailer broken:

T+0m   Domain owner triages in #us-launch-ops
T+30m  Rex aware (check-in comment)
T+2h   Escalate to P1 if unresolved

4.3 Slack Channels

Channel	Use
`#us-launch-ops`	Primary launch war room — all status updates here
`#incidents`	P1/P2 incidents
`#incidents-critical`	P0 only — all-hands
`#us-alerts`	Auto-alerts from uptime monitor

5. Communication Template

5.1 Internal Update (post to `#us-launch-ops` every 15 min during active incident)

[HH:MM EST] Incident update — [P0/P1/P2]
Status: 🔴 DOWN / 🟡 DEGRADED / 🟢 RECOVERING

- Impact: [what users are experiencing]
- Root cause: [known / investigating]
- Action taken: [what we've done]
- ETA to resolution: [estimate or "unknown — next update in 15 min"]

— Rex

5.2 Public Status Post (if outage lasts > 15 minutes)

Post to the BuyWhere status page and/or the developer Slack if applicable:

[HH:MM UTC] Investigating — API degradation

We are aware of issues affecting the BuyWhere API. Our team is actively
investigating. Product search and affiliate redirects may be slow or
unavailable during this period.

We will post an update in 15 minutes.
— BuyWhere Engineering

Resolution post:

[HH:MM UTC] Resolved — API restored

The API is fully operational. Impact window: [HH:MM]–[HH:MM] UTC.
Root cause: [one sentence].
We will publish a post-mortem within 24 hours.
— BuyWhere Engineering

5.3 All-Clear (internal)