← Back to documentation

deploy-runbook-apr23

BuyWhere Production Deploy Runbook — April 23, 2026

Issue: BUY-3511 Classification: Internal — Confidential Owner: Rex (CTO) / Bolt (Infra) Deploy Window: April 23, 2026 — 07:00–08:30 EST (pre-launch; launch is 09:00 EST) Last Updated: 2026-04-19

This runbook covers the production deploy sequence only. For launch-day ops, comms, and incident escalation, see docs/launch-day-runbook.md.


Quick Reference

ResourceValue
Repo root/home/paperclip/buywhere-api/
Compose filedocker-compose.prod.yml
Deploy script./deploy.sh
API (local)http://localhost:8000
API (public)https://api.buywhere.ai
MCP (local)http://localhost:8080
Rollback state file.rollback_state

1. Pre-Deploy Checklist

Run this before issuing any deploy commands. All boxes must be checked.

1.1 Git State

cd /home/paperclip/buywhere-api

# Confirm you are on master and fully synced
git status
git log --oneline -3
  • Working tree is clean (nothing to commit)
  • Latest commit matches the build you intend to ship
  • Note the commit SHA — you will need it for rollback tagging:
    export DEPLOY_SHA=$(git rev-parse --short HEAD)
    echo "Deploy SHA: $DEPLOY_SHA"
    

1.2 Environment File

# Verify .env exists and is not zero-length
ls -lh .env
  • .env present (not .env.example)
  • All required vars set (run a quick spot-check):
    grep -E "DATABASE_URL|REDIS_URL|JWT_SECRET_KEY|POSTGRES_PASSWORD|AFFILIATE_TAG|USD_DEFAULT|US_REGION" .env
    
    Expected: all 7 keys present with non-empty values.
  • USD_DEFAULT is set (controls US dollar pricing)
  • US_REGION=us is set
  • AFFILIATE_TAG=buywhere-20 is set

1.3 Database Health (running stack)

Only if the stack is already running:

# DB primary accepting connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog -c "SELECT 1;" 2>&1
# Expected: " 1\n----\n  1"

# DB replica in sync
docker exec buywhere-api-db_replica-1 psql -U buywhere -d catalog -c "SELECT 1;" 2>&1

# PgBouncer pool health
docker exec buywhere-api-pgbouncer-1 psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;" 2>/dev/null | head -10

# Redis responsive
docker exec buywhere-api-redis-1 redis-cli ping
# Expected: PONG
  • DB primary: SELECT 1 returns successfully
  • DB replica: SELECT 1 returns successfully
  • PgBouncer: pool usage < 70%
  • Redis: returns PONG

1.4 Disk Space

df -h / | tail -1
  • Disk usage < 85%. If ≥ 85%, run cleanup before proceeding:
    docker system prune -f --volumes=false
    find /home/paperclip/buywhere-api -name "*.log" -mtime +7 -delete
    

1.5 Last Backup

ls -lh /home/paperclip/buywhere-api/backups/ | tail -5
  • A backup exists dated within the last 24 hours. If not, run manual backup:
    ./deploy.sh backup
    
    Do not proceed without a backup.

1.6 Pending Migrations Check

docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
  -c "SELECT version_num FROM alembic_version;"

Note the current migration version. After deploy, it must advance to the latest head.


2. Deploy Command Sequence

Run as Bolt on the production host. All commands from /home/paperclip/buywhere-api/.

Step 1 — Save Rollback State

cd /home/paperclip/buywhere-api

# Capture current image digest before overwriting
PREV_IMAGE=$(docker inspect buywhere-api:latest --format='{{.Id}}' 2>/dev/null || echo "none")
PREV_MCP_IMAGE=$(docker inspect buywhere-mcp:latest --format='{{.Id}}' 2>/dev/null || echo "none")
DEPLOY_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ)
DEPLOY_SHA=$(git rev-parse --short HEAD)

cat > .rollback_state <<EOF
PREV_API_IMAGE="${PREV_IMAGE}"
PREV_MCP_IMAGE="${PREV_MCP_IMAGE}"
PREV_SHA="${DEPLOY_SHA}"
DEPLOY_TIME="${DEPLOY_TIME}"
EOF

echo "Rollback state saved:"
cat .rollback_state

Step 2 — Build Docker Images

docker compose -f docker-compose.prod.yml build --parallel 2>&1 | tee /tmp/build-${DEPLOY_SHA}.log
echo "Build exit code: $?"
  • Build exits 0. If non-zero, stop — do not proceed.
  • Check the build log for warnings about missing packages or failed test steps.

Tag the image with the deploy SHA for traceability:

docker tag buywhere-api:latest buywhere-api:sha-${DEPLOY_SHA}
docker tag buywhere-mcp:latest buywhere-mcp:sha-${DEPLOY_SHA}
echo "Tagged: buywhere-api:sha-${DEPLOY_SHA}"

Step 3 — Start DB and Redis

docker compose -f docker-compose.prod.yml up -d db db_replica pgbouncer redis

Wait for all to be healthy (up to 60s):

for svc in db db_replica pgbouncer redis; do
  echo -n "Waiting for $svc... "
  for i in $(seq 1 30); do
    STATUS=$(docker inspect --format='{{.State.Health.Status}}' "buywhere-api-${svc}-1" 2>/dev/null || echo "unknown")
    [ "$STATUS" = "healthy" ] && echo "OK" && break
    [ $i -eq 30 ] && echo "TIMEOUT — check: docker logs buywhere-api-${svc}-1"
    sleep 2
  done
done
  • All 4 services: healthy

Step 4 — Run Database Migrations

docker compose -f docker-compose.prod.yml run --rm migrate 2>&1 | tee /tmp/migrate-${DEPLOY_SHA}.log
echo "Migration exit code: $?"
  • Exit code 0. If non-zero, stop immediately — do not bring up the API.
  • Verify migration applied:
    docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
      -c "SELECT version_num FROM alembic_version;"
    
    Confirm the version matches the latest migration file in alembic/versions/.

Step 5 — Start API

docker compose -f docker-compose.prod.yml up -d api

Wait for health (up to 120s — the API has a 120s start_period):

echo "Waiting for API health..."
for i in $(seq 1 60); do
  HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8000/health 2>/dev/null)
  [ "$HTTP" = "200" ] && echo "API healthy after ${i}x2s" && break
  [ $i -eq 60 ] && echo "API health TIMEOUT" && docker logs --tail=50 buywhere-api-api-1
  sleep 2
done
  • API returns HTTP 200 on /health

Step 6 — Start MCP and Supporting Services

docker compose -f docker-compose.prod.yml up -d mcp scraper-scheduler

Wait for MCP (30s):

for i in $(seq 1 15); do
  HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8080/health 2>/dev/null)
  [ "$HTTP" = "200" ] && echo "MCP healthy" && break
  [ $i -eq 15 ] && echo "WARNING: MCP health timeout — non-blocking, continue"
  sleep 2
done

MCP health failure is non-blocking for launch but log it.

Step 7 — Start Monitoring and Cron Services

docker compose -f docker-compose.prod.yml up -d \
  backup-cron \
  metrics-collector \
  blackbox-exporter \
  alertmanager \
  loki \
  fluent-bit \
  grafana

Step 8 — Verify Full Stack

docker compose -f docker-compose.prod.yml ps

Expected output: all critical services Up with (healthy):

  • buywhere-api-api-1Up (healthy)
  • buywhere-api-db-1Up (healthy)
  • buywhere-api-db_replica-1Up (healthy)
  • buywhere-api-pgbouncer-1Up (healthy)
  • buywhere-api-redis-1Up (healthy)
  • buywhere-api-mcp-1Up (healthy) or Up (non-blocking)

3. Smoke Test Suite

Run these after deploy completes. All must pass before signalling Go to Rex.

Set the base URL once:

API="https://api.buywhere.ai"

Test 1 — Health Endpoint

curl -sf "${API}/health" | python3 -m json.tool

Pass: HTTP 200, "status": "ok" in response body.

Test 2 — Detailed Health (DB + dependencies)

curl -sf "${API}/health/detailed" | python3 -m json.tool

Pass: HTTP 200, "status": "healthy", db_response_ms present and < 500ms.

Test 3 — Product Listing

curl -sf "${API}/v1/products?limit=1" | python3 -m json.tool

Pass: HTTP 200, total field > 0, products array non-empty.

Test 4 — Search

curl -sf "${API}/v1/search?q=laptop&limit=3&currency=USD" | python3 -m json.tool

Pass: HTTP 200, total > 0, at least 1 product in results with currency: "USD".

Test 5 — Search (mobile)

curl -sf "${API}/v1/search?q=iphone&limit=1&currency=USD" | jq '{total, first_product: .products[0].name}'

Pass: Returns a named product result.

Test 6 — Affiliate Redirect

curl -Ls -o /dev/null -w "%{url_effective}\n" "${API}/go/B09G9HDHJT"

Pass: Final URL is an amazon.com URL containing tag=buywhere-20.

Test 7 — MCP Health

curl -sf "http://localhost:8080/health" | python3 -m json.tool

Pass: HTTP 200, "status": "ok".

Test 8 — Catalog Status

curl -sf "${API}/v1/status" \
  -H "Authorization: Bearer ${BUYWHERE_API_KEY}" | python3 -m json.tool

Pass: HTTP 200, total_active_products > 1,000,000.

Test 9 — Latency Baseline

for i in 1 2 3; do
  curl -sf -o /dev/null -w "Search latency: %{time_total}s\n" \
    "${API}/v1/search?q=phone&limit=5&currency=USD"
done

Pass: All 3 requests complete in < 2s. If any > 2s, flag to Rex before Go.

Test 10 — SSL Certificate

echo | openssl s_client -connect api.buywhere.ai:443 2>/dev/null \
  | openssl x509 -noout -dates

Pass: notAfter is > 30 days from today (> 2026-05-23).


4. Go/No-Go Decision Points

Rex calls Go or Hold at 08:30 EST based on Bolt's smoke test report.

Mandatory Go Conditions

All must be met to proceed to launch:

#ConditionThresholdAction if failed
1API health endpointHTTP 200, status: okHOLD — page Bolt
2DB healthyDetailed health shows healthyHOLD — page Bolt
3Search functionalReturns results with USD pricingHOLD — page Bolt
4Affiliate redirects/go/ appends tag=buywhere-20HOLD — page Sol/Link
5Backup completed< 24h oldHOLD — run backup first
6Disk usage< 85%HOLD — emergency cleanup
7P99 latency< 2s on searchHOLD — investigate warm-up

Rollback Triggers (Post-Launch)

Initiate rollback if any condition is sustained:

MetricWarning (monitor)Critical (rollback)
5xx error rate> 0.5% for 5 min> 2% for 5 min
P99 search latency> 1s for 5 min> 3s for 5 min
API health check1 failure3 consecutive failures
DB pool usage> 70%> 90%
Redis memory> 70%> 85%
Disk> 85%> 92%

5. Rollback Procedure

5.1 When to Roll Back

Call rollback if:

  • Any mandatory Go condition fails after deploy and cannot be fixed in < 15 min
  • 5xx error rate > 2% sustained for > 5 minutes post-launch
  • P99 latency > 3s sustained for > 5 minutes
  • DB pool exhaustion (> 90%) with no quick fix
  • Data corruption confirmed

Rex calls the rollback. Post to #us-launch-ops before executing:

⚠️ ROLLBACK INITIATED — [HH:MM EST]
Reason: [one sentence]
Lead: Bolt
ETA: [estimate]

5.2 Rollback Steps

Step 1 — Stop API and MCP (to prevent further writes)

docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml stop api mcp scraper-scheduler

Step 2 — Load rollback state

source /home/paperclip/buywhere-api/.rollback_state
echo "Rolling back to image digest: ${PREV_API_IMAGE}"

If .rollback_state is missing, use the tagged SHA image:

# List recent SHA-tagged images
docker images buywhere-api --format "table {{.Tag}}\t{{.CreatedAt}}" | grep sha
# Use the most recent prior sha tag:
# PREV_API_IMAGE=buywhere-api:sha-<previous-sha>

Step 3 — Restore previous image tag

# Re-tag the prior image as :latest
docker tag "${PREV_API_IMAGE}" buywhere-api:latest
docker tag "${PREV_MCP_IMAGE}" buywhere-mcp:latest

Step 4 — Restart services with prior image

docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml up -d api

Wait for health:

for i in $(seq 1 30); do
  HTTP=$(curl -sf -o /dev/null -w "%{http_code}" http://localhost:8000/health 2>/dev/null)
  [ "$HTTP" = "200" ] && echo "Rollback API healthy" && break
  sleep 2
done

Step 5 — If rollback also fails, run migration rollback

Only if the deployment introduced a breaking migration:

# Check what the previous migration version was from .rollback_state (PREV_SHA)
# Then downgrade:
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml run --rm \
  -e ALEMBIC_TARGET=<previous-version-num> \
  migrate alembic downgrade <previous-version-num>

Warning: Only downgrade if the new migration is confirmed reversible and Bolt has read the migration script.

Step 6 — Verify rollback

Re-run smoke tests 1–5 from Section 3. Confirm all pass before declaring stable.

Step 7 — Communicate

# Post to Slack #us-launch-ops
echo "Post status update to #us-launch-ops and DM Rex"

6. Rollback Verification Checklist

After rollback completes:

  • GET /health → HTTP 200, status: ok
  • GET /health/detailed → all dependencies healthy
  • GET /v1/search?q=laptop&limit=1 → returns results
  • GET /go/B09G9HDHJT → redirects with buywhere-20 tag
  • Error rate in Sentry < 0.5%
  • No new P1 alerts in #us-alerts
  • Post rollback complete status to #us-launch-ops

7. Useful Commands Reference

# Full stack status
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml ps

# Tail API logs (live)
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs -f api

# Tail MCP logs
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml logs -f mcp

# DB active connections
docker exec buywhere-api-db-1 psql -U buywhere -d catalog \
  -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# PgBouncer pool stats
docker exec buywhere-api-pgbouncer-1 psql -h localhost -p 5432 -U pgbouncer pgbouncer -c "SHOW POOLS;"

# Redis memory usage
docker exec buywhere-api-redis-1 redis-cli info memory | grep -E "used_memory_human|maxmemory_human"

# Disk usage
df -h / | tail -1

# Docker stats snapshot
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"

# Emergency: clear Docker build cache (does NOT affect running containers)
docker builder prune -f

# Emergency: restart only the API container
docker compose -f /home/paperclip/buywhere-api/docker-compose.prod.yml restart api

# Emergency: full stack restart (last resort)
./deploy.sh restart

Appendix — Related Documents

DocumentPath
Launch-day ops + war roomdocs/launch-day-runbook.md
Existing generic deploy runbookDEPLOYMENT_RUNBOOK.md
Backup + restoredocs/backup_restore_runbook.md
DB architecturedocs/DATABASE_ARCHITECTURE.md
Emergency API scalingdocs/emergency_api_scaling_runbook.md
Pre-deploy backup scriptscripts/pre-deploy-backup.sh
Rollback helper scriptscripts/rollback.sh
Deploy state trackerscripts/deployment-state.sh

Authored by Rex (CTO) — 2026-04-19 BuyWhere Production Deploy — April 23, 2026