BuyWhere Production Deploy Runbook — April 23, 2026
Issue: BUY-3511 Classification: Internal — Confidential Owner: Rex (CTO) / Bolt (Infra) Deploy Window: April 23, 2026 — 07:00–08:30 EST (pre-launch; launch is 09:00 EST) Last Updated: 2026-04-19
This runbook covers the production deploy sequence only. For launch-day ops, comms, and incident escalation, see
docs/launch-day-runbook.md.
Quick Reference
| Resource | Value |
|---|---|
| Repo root | /home/paperclip/buywhere-api/ |
| Compose file | docker-compose.prod.yml |
| Deploy script | ./deploy.sh |
| API (local) | http://localhost:8000 |
| API (public) | https://api.buywhere.ai |
| MCP (local) | http://localhost:8080 |
| Rollback state file | .rollback_state |
1. Pre-Deploy Checklist
Run this before issuing any deploy commands. All boxes must be checked.
1.1 Git State
cd /home/paperclip/buywhere-api
# Confirm you are on master and fully synced
git status
git log --oneline -3
- Working tree is clean (
nothing to commit) - Latest commit matches the build you intend to ship
- Note the commit SHA — you will need it for rollback tagging:
export DEPLOY_SHA=$(git rev-parse --short HEAD) echo "Deploy SHA: $DEPLOY_SHA"
1.2 Environment File
# Verify .env exists and is not zero-length
ls -lh .env
-
.envpresent (not.env.example) - All required vars set (run a quick spot-check):
Expected: all 7 keys present with non-empty values.grep -E "DATABASE_URL|REDIS_URL|JWT_SECRET_KEY|POSTGRES_PASSWORD|AFFILIATE_TAG|USD_DEFAULT|US_REGION" .env -
USD_DEFAULTis set (controls US dollar pricing) -
US_REGION=usis set -
AFFILIATE_TAG=buywhere-20is set
1.3 Database Health (running stack)
Only if the stack is already running:
# DB primary accepting connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog -c "SELECT 1;" 2>&1
# Expected: " 1\n----\n 1"
# DB replica in sync
docker exec buywhere-api-db_replica-1 psql -U buywhere -d catalog -c "SELECT 1;" 2>&1
# PgBouncer pool health
docker exec buywhere-api-pgbouncer-1 psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;" 2>/dev/null | head -10
# Redis responsive
docker exec buywhere-api-redis-1 redis-cli ping
# Expected: PONG
- DB primary:
SELECT 1returns successfully - DB replica:
SELECT 1returns successfully - PgBouncer: pool usage < 70%
- Redis: returns
PONG
1.4 Disk Space
df -h / | tail -1
- Disk usage < 85%. If ≥ 85%, run cleanup before proceeding:
docker system prune -f --volumes=false find /home/paperclip/buywhere-api -name "*.log" -mtime +7 -delete
1.5 Last Backup
ls -lh /home/paperclip/buywhere-api/backups/ | tail -5
- A backup exists dated within the last 24 hours. If not, run manual backup:
Do not proceed without a backup../deploy.sh backup
1.6 Pending Migrations Check
docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
-c "SELECT version_num FROM alembic_version;"
Note the current migration version. After deploy, it must advance to the latest head.
2. Deploy Command Sequence
Run as Bolt on the production host. All commands from
/home/paperclip/buywhere-api/.
Step 1 — Save Rollback State
cd /home/paperclip/buywhere-api
# Capture current image digest before overwriting
PREV_IMAGE=$(docker inspect buywhere-api:latest --format='{{.Id}}' 2>/dev/null || echo "none")
PREV_MCP_IMAGE=$(docker inspect buywhere-mcp:latest --format='{{.Id}}' 2>/dev/null || echo "none")
DEPLOY_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
DEPLOY_SHA=$(git rev-parse --short HEAD)
cat > .rollback_state <<EOF
PREV_API_IMAGE="${PREV_IMAGE}"
PREV_MCP_IMAGE="${PREV_MCP_IMAGE}"
PREV_SHA="${DEPLOY_SHA}"
DEPLOY_TIME="${DEPLOY_TIME}"
EOF
echo "Rollback state saved:"
cat .rollback_state
Step 2 — Build Docker Images
docker compose -f docker-compose.prod.yml build --parallel 2>&1 | tee /tmp/build-${DEPLOY_SHA}.log
echo "Build exit code: $?"
- Build exits
0. If non-zero, stop — do not proceed. - Check the build log for warnings about missing packages or failed test steps.
Tag the image with the deploy SHA for traceability:
docker tag buywhere-api:latest buywhere-api:sha-${DEPLOY_SHA}
docker tag buywhere-mcp:latest buywhere-mcp:sha-${DEPLOY_SHA}
echo "Tagged: buywhere-api:sha-${DEPLOY_SHA}"
Step 3 — Start DB and Redis
docker compose -f docker-compose.prod.yml up -d db db_replica pgbouncer redis
Wait for all to be healthy (up to 60s):
for svc in db db_replica pgbouncer redis; do
echo -n "Waiting for $svc... "
for i in $(seq 1 30); do
STATUS=$(docker inspect --format='{{.State.Health.Status}}' "buywhere-api-${svc}-1" 2>/dev/null || echo "unknown")
[ "$STATUS" = "healthy" ] && echo "OK" && break
[ $i -eq 30 ] && echo "TIMEOUT — check: docker logs buywhere-api-${svc}-1"
sleep 2
done
done
- All 4 services:
healthy
Step 4 — Run Database Migrations
docker compose -f docker-compose.prod.yml run --rm migrate 2>&1 | tee /tmp/migrate-${DEPLOY_SHA}.log
echo "Migration exit code: $?"
- Exit code
0. If non-zero, stop immediately — do not bring up the API. - Verify migration applied:
Confirm the version matches the latest migration file indocker exec buywhere-api-db-1 psql -U buywhere -d catalog \ -c "SELECT version_num FROM alembic_version;"alembic/versions/.
Step 5 — Start API
docker compose -f docker-compose.prod.yml up -d api
Wait for health (up to 120s — the API has a 120s start_period):
echo "Waiting for API health..."
for i in $(seq 1 60); do
HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8000/health 2>/dev/null)
[ "$HTTP" = "200" ] && echo "API healthy after ${i}x2s" && break
[ $i -eq 60 ] && echo "API health TIMEOUT" && docker logs --tail=50 buywhere-api-api-1
sleep 2
done
- API returns HTTP 200 on
/health
Step 6 — Start MCP and Supporting Services
docker compose -f docker-compose.prod.yml up -d mcp scraper-scheduler
Wait for MCP (30s):
for i in $(seq 1 15); do
HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8080/health 2>/dev/null)
[ "$HTTP" = "200" ] && echo "MCP healthy" && break
[ $i -eq 15 ] && echo "WARNING: MCP health timeout — non-blocking, continue"
sleep 2
done
MCP health failure is non-blocking for launch but log it.
Step 7 — Start Monitoring and Cron Services
docker compose -f docker-compose.prod.yml up -d \
backup-cron \
metrics-collector \
blackbox-exporter \
alertmanager \
loki \
fluent-bit \
grafana
Step 8 — Verify Full Stack
docker compose -f docker-compose.prod.yml ps
Expected output: all critical services Up with (healthy):
buywhere-api-api-1—Up (healthy)buywhere-api-db-1—Up (healthy)buywhere-api-db_replica-1—Up (healthy)buywhere-api-pgbouncer-1—Up (healthy)buywhere-api-redis-1—Up (healthy)buywhere-api-mcp-1—Up (healthy)orUp(non-blocking)
3. Smoke Test Suite
Run these after deploy completes. All must pass before signalling Go to Rex.
Set the base URL once:
API="https://api.buywhere.ai"
Test 1 — Health Endpoint
curl -sf "${API}/health" | python3 -m json.tool
Pass: HTTP 200, "status": "ok" in response body.
Test 2 — Detailed Health (DB + dependencies)
curl -sf "${API}/health/detailed" | python3 -m json.tool
Pass: HTTP 200, "status": "healthy", db_response_ms present and < 500ms.
Test 3 — Product Listing
curl -sf "${API}/v1/products?limit=1" | python3 -m json.tool
Pass: HTTP 200, total field > 0, products array non-empty.
Test 4 — Search
curl -sf "${API}/v1/search?q=laptop&limit=3¤cy=USD" | python3 -m json.tool
Pass: HTTP 200, total > 0, at least 1 product in results with currency: "USD".
Test 5 — Search (mobile)
curl -sf "${API}/v1/search?q=iphone&limit=1¤cy=USD" | jq '{total, first_product: .products[0].name}'
Pass: Returns a named product result.
Test 6 — Affiliate Redirect
curl -Ls -o /dev/null -w "%{url_effective}\n" "${API}/go/B09G9HDHJT"
Pass: Final URL is an amazon.com URL containing tag=buywhere-20.
Test 7 — MCP Health
curl -sf "http://localhost:8080/health" | python3 -m json.tool
Pass: HTTP 200, "status": "ok".
Test 8 — Catalog Status
curl -sf "${API}/v1/status" \
-H "Authorization: Bearer ${BUYWHERE_API_KEY}" | python3 -m json.tool
Pass: HTTP 200, total_active_products > 1,000,000.
Test 9 — Latency Baseline
for i in 1 2 3; do
curl -sf -o /dev/null -w "Search latency: %{time_total}s\n" \
"${API}/v1/search?q=phone&limit=5¤cy=USD"
done
Pass: All 3 requests complete in < 2s. If any > 2s, flag to Rex before Go.
Test 10 — SSL Certificate
echo | openssl s_client -connect api.buywhere.ai:443 2>/dev/null \
| openssl x509 -noout -dates
Pass: notAfter is > 30 days from today (> 2026-05-23).
4. Go/No-Go Decision Points
Rex calls Go or Hold at 08:30 EST based on Bolt's smoke test report.
Mandatory Go Conditions
All must be met to proceed to launch:
| # | Condition | Threshold | Action if failed |
|---|---|---|---|
| 1 | API health endpoint | HTTP 200, status: ok | HOLD — page Bolt |
| 2 | DB healthy | Detailed health shows healthy | HOLD — page Bolt |
| 3 | Search functional | Returns results with USD pricing | HOLD — page Bolt |
| 4 | Affiliate redirects | /go/ appends tag=buywhere-20 | HOLD — page Sol/Link |
| 5 | Backup completed | < 24h old | HOLD — run backup first |
| 6 | Disk usage | < 85% | HOLD — emergency cleanup |
| 7 | P99 latency | < 2s on search | HOLD — investigate warm-up |
Rollback Triggers (Post-Launch)
Initiate rollback if any condition is sustained:
| Metric | Warning (monitor) | Critical (rollback) |
|---|---|---|
| 5xx error rate | > 0.5% for 5 min | > 2% for 5 min |
| P99 search latency | > 1s for 5 min | > 3s for 5 min |
| API health check | 1 failure | 3 consecutive failures |
| DB pool usage | > 70% | > 90% |
| Redis memory | > 70% | > 85% |
| Disk | > 85% | > 92% |
5. Rollback Procedure
5.1 When to Roll Back
Call rollback if:
- Any mandatory Go condition fails after deploy and cannot be fixed in < 15 min
- 5xx error rate > 2% sustained for > 5 minutes post-launch
- P99 latency > 3s sustained for > 5 minutes
- DB pool exhaustion (> 90%) with no quick fix
- Data corruption confirmed
Rex calls the rollback. Post to #us-launch-ops before executing:
⚠️ ROLLBACK INITIATED — [HH:MM EST]
Reason: [one sentence]
Lead: Bolt
ETA: [estimate]
5.2 Rollback Steps
Step 1 — Stop API and MCP (to prevent further writes)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml stop api mcp scraper-scheduler
Step 2 — Load rollback state
source /home/paperclip/buywhere-api/.rollback_state
echo "Rolling back to image digest: ${PREV_API_IMAGE}"
If .rollback_state is missing, use the tagged SHA image:
# List recent SHA-tagged images
docker images buywhere-api --format "table {{.Tag}}\t{{.CreatedAt}}" | grep sha
# Use the most recent prior sha tag:
# PREV_API_IMAGE=buywhere-api:sha-<previous-sha>
Step 3 — Restore previous image tag
# Re-tag the prior image as :latest
docker tag "${PREV_API_IMAGE}" buywhere-api:latest
docker tag "${PREV_MCP_IMAGE}" buywhere-mcp:latest
Step 4 — Restart services with prior image
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml up -d api
Wait for health:
for i in $(seq 1 30); do
HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8000/health 2>/dev/null)
[ "$HTTP" = "200" ] && echo "Rollback API healthy" && break
sleep 2
done
Step 5 — If rollback also fails, run migration rollback
Only if the deployment introduced a breaking migration:
# Check what the previous migration version was from .rollback_state (PREV_SHA)
# Then downgrade:
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml run --rm \
-e ALEMBIC_TARGET=<previous-version-num> \
migrate alembic downgrade <previous-version-num>
Warning: Only downgrade if the new migration is confirmed reversible and Bolt has read the migration script.
Step 6 — Verify rollback
Re-run smoke tests 1–5 from Section 3. Confirm all pass before declaring stable.
Step 7 — Communicate
# Post to Slack #us-launch-ops
echo "Post status update to #us-launch-ops and DM Rex"
6. Rollback Verification Checklist
After rollback completes:
-
GET /health→ HTTP 200,status: ok -
GET /health/detailed→ all dependencies healthy -
GET /v1/search?q=laptop&limit=1→ returns results -
GET /go/B09G9HDHJT→ redirects withbuywhere-20tag - Error rate in Sentry < 0.5%
- No new P1 alerts in
#us-alerts - Post rollback complete status to
#us-launch-ops
7. Useful Commands Reference
# Full stack status
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml ps
# Tail API logs (live)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs -f api
# Tail MCP logs
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs -f mcp
# DB active connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
-c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# PgBouncer pool stats
docker exec buywhere-api-pgbouncer-1 psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
# Redis memory usage
docker exec buywhere-api-redis-1 redis-cli info memory | grep -E "used_memory_human|maxmemory_human"
# Disk usage
df -h / | tail -1
# Docker stats snapshot
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Emergency: clear Docker build cache (does NOT affect running containers)
docker builder prune -f
# Emergency: restart only the API container
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml restart api
# Emergency: full stack restart (last resort)
./deploy.sh restart
Appendix — Related Documents
| Document | Path |
|---|---|
| Launch-day ops + war room | docs/launch-day-runbook.md |
| Existing generic deploy runbook | DEPLOYMENT_RUNBOOK.md |
| Backup + restore | docs/backup_restore_runbook.md |
| DB architecture | docs/DATABASE_ARCHITECTURE.md |
| Emergency API scaling | docs/emergency_api_scaling_runbook.md |
| Pre-deploy backup script | scripts/pre-deploy-backup.sh |
| Rollback helper script | scripts/rollback.sh |
| Deploy state tracker | scripts/deployment-state.sh |
Authored by Rex (CTO) — 2026-04-19 BuyWhere Production Deploy — April 23, 2026