BuyWhere API — Incident Response Runbook
April 23, 2026 US Launch
Issue: BUY-3520 Classification: Internal — Confidential Owner: Rex (CTO) Launch Date: Thursday, April 23, 2026 Last Updated: 2026-04-19
Related runbooks:
- Deploy runbook:
docs/deploy-runbook-apr23.md- Launch-day ops:
docs/launch-day-runbook.md- Rollback detail:
docs/deploy-runbook-apr23.md#5-rollback-procedure- Emergency scaling:
docs/emergency_api_scaling_runbook.md
1. Severity Levels
| Severity | Definition | Example | Target Response | Owner |
|---|---|---|---|---|
| P0 | Full API down — all requests failing or health check failing | /health returns non-200; all searches fail; DB unreachable | Immediate — 0–5 min | Rex + Bolt |
| P1 | Severe degradation — API partially up but critical paths broken | Search response > 5s p99; affiliate redirects broken; DB pool > 90%; error rate > 5% | 5–15 min | Rex + domain owner |
| P2 | Single retailer or feature degraded — core product still functional | Amazon results missing; USD pricing wrong on one source; single /go/ ASIN broken | 15–60 min | Domain owner |
Escalation rule: If a P1 is not resolved within 30 minutes, treat it as P0. If a P2 is not resolved within 2 hours, escalate to P1.
2. First 5 Minutes — Diagnostic Commands
Run these in order immediately on incident detection. Each check takes ~10 seconds.
2.1 Check API Health
# Basic health check
curl -sf https://api.buywhere.ai/health | python3 -m json.tool
# Detailed health (DB + dependencies)
curl -sf https://api.buywhere.ai/health/detailed | python3 -m json.tool
Healthy output: "status": "ok" with db_response_ms < 500.
If this fails: go immediately to 2.2 (container status) and 2.3 (DB).
2.2 Check Container Status
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml ps
Healthy output: all critical services showing Up (healthy):
buywhere-api-api-1buywhere-api-db-1buywhere-api-db_replica-1buywhere-api-pgbouncer-1buywhere-api-redis-1
If any container is Exited or Restarting:
# Tail logs for the failing container (replace 'api' with the failing service)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs --tail=100 api
2.3 Check DB Connections (PgBouncer Pool)
# PgBouncer pool stats
docker exec buywhere-api-pgbouncer-1 \
psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;" 2>/dev/null
# DB active connection count by state
docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
-c "SELECT count(*), state FROM pg_stat_activity GROUP BY state ORDER BY count DESC;"
# Verify DB primary is accepting connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog -c "SELECT 1;" 2>&1
Warning: pool cl_active > 70 of max.
Critical: pool cl_waiting > 0 (requests queuing — near exhaustion).
2.4 Check Redis
docker exec buywhere-api-redis-1 redis-cli ping
# Expected: PONG
docker exec buywhere-api-redis-1 redis-cli info memory | grep -E "used_memory_human|maxmemory_human"
Warning: used_memory > 70% of maxmemory.
2.5 Check Error Rate in Logs
# Tail API logs for errors (last 5 minutes)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs --since=5m api \
| grep -iE "error|exception|fatal|traceback|500|ECONNREFUSED" | tail -30
Look for:
ECONNREFUSED— DB or Redis connection refused (pool exhausted or service down)500responses — application errorsFATAL/panic— service about to crash or crasheddisk quota/no space left— disk pressure
2.6 Check Recent Deployments
# Last 3 git commits on running code
cd /home/paperclip/buywhere-api && git log --oneline -5
# When was the API container last restarted
docker inspect buywhere-api-api-1 --format='Started: {{.State.StartedAt}}'
# Check rollback state file (populated by deploy script)
cat /home/paperclip/buywhere-api/.rollback_state 2>/dev/null || echo "No rollback state saved"
If the API restarted recently and the timing matches the incident onset, the latest deploy is the likely cause — proceed to rollback.
2.7 Check Disk Space
df -h / | tail -1
Warning: > 85% — run cleanup (Section 3.4). Critical: > 92% — API writes may fail; immediate cleanup required.
3. Rollback Procedure
Authority: Rex calls rollback. Bolt executes. Never roll back without Rex's explicit instruction posted to
#us-launch-ops.
3.1 When to Roll Back
Rollback immediately if:
GET /healthreturns non-200 and cannot be restored within 10 minutes- Error rate > 2% sustained for > 5 minutes
- P99 search latency > 3s sustained for > 5 minutes
- DB pool exhaustion (> 90%) with no quick fix
- Data corruption confirmed
- Any P1 unresolved after 30 minutes
3.2 Announce Before Rolling Back
Post to #us-launch-ops and #incidents-critical:
⚠️ ROLLBACK INITIATED — [HH:MM EST]
Reason: [one sentence — e.g. "Error rate >5% sustained 10min, root cause unknown"]
Lead: Bolt
ETA to stable: ~15 min
Next update: [HH:MM EST]
— Rex
3.3 Rollback Steps
Step 1 — Stop API and prevent further writes
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml \
stop api mcp scraper-scheduler
Step 2 — Load rollback state
source /home/paperclip/buywhere-api/.rollback_state
echo "Rolling back to image: ${PREV_API_IMAGE}"
echo "Prior deploy SHA: ${PREV_SHA}"
If .rollback_state is missing (should not happen if deploy script was followed):
# List SHA-tagged images to identify the prior build
docker images buywhere-api --format "table {{.Tag}}\t{{.CreatedAt}}" | grep sha-
# Use the second-most-recent sha tag as PREV_API_IMAGE
Step 3 — Re-tag prior image as latest
docker tag "${PREV_API_IMAGE}" buywhere-api:latest
docker tag "${PREV_MCP_IMAGE}" buywhere-mcp:latest
echo "Re-tagged prior image as :latest"
Step 4 — Restart API with prior image
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml up -d api
Wait for health (up to 2 minutes):
for i in $(seq 1 30); do
HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8000/health 2>/dev/null)
[ "$HTTP" = "200" ] && echo "Rollback API healthy after ${i}x4s" && break
[ $i -eq 30 ] && echo "ROLLBACK HEALTH TIMEOUT — escalate to Rex" && break
sleep 4
done
Step 5 — Verify rollback
# Health
curl -sf https://api.buywhere.ai/health | python3 -m json.tool
# Search functional
curl -sf "https://api.buywhere.ai/v1/search?q=laptop&limit=1¤cy=USD" | python3 -m json.tool
# Affiliate redirect
curl -Ls -o /dev/null -w "%{url_effective}\n" https://api.buywhere.ai/go/B09G9HDHJT
Pass criteria: health 200, search returns results, redirect includes buywhere-20.
Step 6 — If rollback itself fails (migration rollback)
Only if the new deploy introduced a breaking database migration:
# Identify previous migration version from .rollback_state PREV_SHA
# Then downgrade alembic:
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml run --rm \
migrate alembic downgrade -1
Warning: Only execute if Bolt has confirmed the migration is reversible. Read the migration script first.
3.4 Disk Emergency Cleanup
If disk > 85% is causing the incident:
# Identify large consumers
du -sh /home/paperclip/buywhere-api/* | sort -rh | head -10
# Clear old scrape data (> 3 days old)
find /home/paperclip/buywhere-api -name "*.jsonl" -mtime +3 -delete
find /home/paperclip/buywhere-api -name "*.json" -mtime +7 -delete
# Clear Docker build cache (does NOT affect running containers)
docker builder prune -f
# Clear old container logs
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs --no-color api \
| wc -l # check size, then truncate if excessive
4. Escalation Path
4.1 Who to Contact
| Role | Agent | Contact | Escalate when |
|---|---|---|---|
| Incident lead | Rex (CTO) | #incidents-critical DM | All P0/P1 — Rex auto-leads |
| Infra | Bolt | #incidents-critical + DM | Any container/DB/disk issue |
| Frontend | Sol | #incidents + DM | UI errors, USD formatting, 404s |
| Affiliate | Link | #incidents + DM | /go/ redirect failures |
| QA | Atlas | #us-launch-ops | Smoke test failures, error triage |
| CEO | Vera | DM | P1 unresolved > 30 min; data loss risk; public comms needed |
4.2 Escalation Timeline
P0 — API down:
T+0m Rex acknowledges → posts to #incidents-critical
T+5m Bolt engaged on infra diagnosis
T+10m Domain owner (Sol/Link/Atlas) engaged as needed
T+20m Vera notified via DM if not resolved
T+30m All-hands war room; consider public status update
P1 — Severe degradation:
T+0m Domain owner notified in #incidents
T+10m Rex engaged if not resolved
T+20m Rex posts status to #us-launch-ops
T+30m Treat as P0 if unresolved — follow P0 escalation
P2 — Single feature/retailer broken:
T+0m Domain owner triages in #us-launch-ops
T+30m Rex aware (check-in comment)
T+2h Escalate to P1 if unresolved
4.3 Slack Channels
| Channel | Use |
|---|---|
#us-launch-ops | Primary launch war room — all status updates here |
#incidents | P1/P2 incidents |
#incidents-critical | P0 only — all-hands |
#us-alerts | Auto-alerts from uptime monitor |
5. Communication Template
5.1 Internal Update (post to #us-launch-ops every 15 min during active incident)
[HH:MM EST] Incident update — [P0/P1/P2]
Status: 🔴 DOWN / 🟡 DEGRADED / 🟢 RECOVERING
- Impact: [what users are experiencing]
- Root cause: [known / investigating]
- Action taken: [what we've done]
- ETA to resolution: [estimate or "unknown — next update in 15 min"]
— Rex
5.2 Public Status Post (if outage lasts > 15 minutes)
Post to the BuyWhere status page and/or the developer Slack if applicable:
[HH:MM UTC] Investigating — API degradation
We are aware of issues affecting the BuyWhere API. Our team is actively
investigating. Product search and affiliate redirects may be slow or
unavailable during this period.
We will post an update in 15 minutes.
— BuyWhere Engineering
Resolution post:
[HH:MM UTC] Resolved — API restored
The API is fully operational. Impact window: [HH:MM]–[HH:MM] UTC.
Root cause: [one sentence].
We will publish a post-mortem within 24 hours.
— BuyWhere Engineering
5.3 All-Clear (internal)
Post to #us-launch-ops and #incidents-critical:
✅ [HH:MM EST] All-clear — [P0/P1] resolved
Incident: [brief description]
Duration: [HH:MM]–[HH:MM] EST ([N] minutes)
Root cause: [brief]
Fix applied: [brief]
Post-mortem: due by [date — typically next business day]
— Rex
6. Post-Incident
6.1 Immediate (within 1 hour of resolution)
- Post all-clear to
#us-launch-opsand#incidents-critical - Post public resolved notice if a public update was issued
- Confirm rollback state is accurate (update
.rollback_stateif needed) - Verify monitoring and alerting are back in normal state
- Capture a snapshot of key metrics at resolution time
6.2 Post-Mortem (due within 24 hours)
Create a Paperclip task under BUY-3413 for the post-mortem write-up. Document:
- Timeline — minute-by-minute from detection to resolution
- Root cause — what actually caused the incident (not just symptoms)
- Detection — how we found out (alert, user report, manual check)
- Impact — duration, affected users/requests, revenue estimate if possible
- Response — what actions were taken and in what order
- What worked — parts of the runbook that were effective
- What didn't work — gaps in the runbook or tooling
- Action items — specific, assigned, time-bound fixes (file as Paperclip tasks)
6.3 Post-Mortem Template
# Post-Mortem — [Brief Title]
**Date:** 2026-04-23
**Severity:** P0 / P1 / P2
**Duration:** HH:MM–HH:MM EST (N minutes)
**Author:** Rex
## Summary
[2-3 sentence summary of what happened and the impact]
## Timeline
| Time (EST) | Event |
|-----------|-------|
| HH:MM | Incident detected |
| HH:MM | [action] |
| HH:MM | [action] |
| HH:MM | All-clear |
## Root Cause
[Technical explanation of what failed and why]
## Impact
- Users affected: [estimate]
- Requests failed: [estimate]
- Features affected: [list]
## What We Did Well
- [bullet]
## What We Can Improve
- [bullet]
## Action Items
| Item | Owner | Due |
|------|-------|-----|
| [task] | [agent] | [date] |
Appendix — Quick Reference
Key Commands Cheatsheet
# Stack status
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml ps
# API health
curl -sf https://api.buywhere.ai/health | python3 -m json.tool
# Detailed health
curl -sf https://api.buywhere.ai/health/detailed | python3 -m json.tool
# Live API error stream
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs -f api \
| grep -iE "error|500|fatal"
# DB connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
-c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# PgBouncer pools
docker exec buywhere-api-pgbouncer-1 \
psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;" 2>/dev/null
# Redis check
docker exec buywhere-api-redis-1 redis-cli ping
docker exec buywhere-api-redis-1 redis-cli info memory | grep used_memory_human
# Disk
df -h / | tail -1
# Restart API only (safe — no image change)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml restart api
# Emergency full stack restart (last resort — causes ~2 min downtime)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml up -d
Severity Decision Tree
API /health failing?
├─ YES → P0. Check containers (2.2). Check DB (2.3). Start rollback timer.
└─ NO → Search returning results?
├─ YES but slow (>5s p99) → P1. Check DB pool (2.3). Check Redis (2.4).
├─ NO results → P1. Check DB (2.3). Check error logs (2.5).
└─ YES, functional → Affiliate redirects working?
├─ NO → P1. Notify Link.
└─ YES → Check error rate in Sentry.
├─ >5% → P1
├─ 1–5% → P2, monitor
└─ <1% → Not an incident (yet)
Related Documents
| Document | Path |
|---|---|
| Deploy runbook | docs/deploy-runbook-apr23.md |
| Launch-day ops runbook | docs/launch-day-runbook.md |
| Backup/restore | docs/backup_restore_runbook.md |
| Disaster recovery | docs/disaster_recovery_runbook.md |
| Emergency API scaling | docs/emergency_api_scaling_runbook.md |
| On-call runbook | docs/on-call-runbook.md |
Authored by Rex (CTO) — 2026-04-19 BuyWhere Incident Response — April 23, 2026 US Launch