← Back to documentation

us_launch_runbook

US Launch Ops Runbook

Document Version: 1.0 Last Updated: 2026-04-18 Owner: Ops Team Launch Date: April 23, 2026 Classification: Internal - Confidential


Overview

This runbook covers all operational procedures for the BuyWhere US launch on April 23, 2026. It includes monitoring thresholds, incident response protocols, launch day checklists, and rollback procedures specific to US market operations.


Quick Reference

Incident Severity Levels

SeverityDefinitionResponse TimeExample
P1 - CriticalComplete service outage or data loss15 minutesAPI down, DB failure, all scrapers failing
P2 - HighMajor feature broken, significant impact30 minutesProduct search failing, auth issues
P3 - MediumDegraded performance, minor feature broken2 hoursSlow responses, non-critical errors
P4 - LowMinor issues, cosmetic problemsNext business dayUI typos, non-urgent warnings

Response Times by Severity

  • P1 Critical: Acknowledge in 5 min, engage in 15 min, resolve in 1 hour
  • P2 High: Acknowledge in 15 min, engage in 30 min, resolve in 4 hours
  • P3 Medium: Acknowledge in 1 hour, engage in 2 hours, resolve in 24 hours
  • P4 Low: Acknowledge in 8 hours, resolve in 48 hours

US Launch Monitoring Thresholds

API Performance

MetricWarningCriticalAuto-Action
API Latency P50>200ms>500msScale up
API Latency P95>500ms>1sScale up
API Latency P99>1s>2sScale up + alert
Error Rate (5xx)>0.5%>1%Page on-call
Request Rate>1000/min>2000/minScale up
CPU Utilization>60%>80%Scale up

Data Ingestion (US Sources)

MetricWarningCriticalAction
Last Run Age>12 hours>24 hoursTrigger re-scrape
Error Rate>5%>10%Investigate
Quality Score<0.7<0.5Pause source
Stale Data %>20%>40%Refresh data

Fleet Health

MetricWarningCriticalAction
Unhealthy Scrapers>10%>25%Auto-heartbeat
Fleet Error Concentration>15%>25%Auto-heartbeat
Replication Lag>30s>60sPage on-call

Infrastructure (US Region)

MetricWarningCriticalAction
DB Connections>70%>90%Scale or optimize
DB CPU>60%>85%Scale up
Redis Memory>70%>85%Evict or scale
Disk Space>70%>85%Clean up
Backup Age>24 hours>48 hoursImmediate backup

Escalation Path

Primary On-Call Rotation

Time (ET)PrimarySecondary
00:00 - 08:00@ops-night-oncall@ops-day-oncall
08:00 - 16:00@ops-day-oncall@ops-evening-oncall
16:00 - 24:00@ops-evening-oncall@ops-night-oncall

Escalation Chain

P1 (Critical):
  → @ops-primary-oncall (PagerDuty)
  → @ops-lead (Bolt) (if no response in 10 min)
  → @cto (if no response in 20 min)
  → All hands (if no response in 30 min)

P2 (High):
  → @ops-primary-oncall (Slack + PagerDuty)
  → @ops-lead (if no response in 30 min)

P3 (Medium):
  → @ops-oncall (Slack)
  → @ops-lead (next morning if unresolved)

P4 (Low):
  → @ops-ticket-queue (next business day)

Slack Channels

ChannelPurposeAccess
#us-launch-opsUS launch operations, real-time monitoringAll ops
#us-alertsUS-specific alerts (auto-created)On-call
#incidentsGeneral incidentsAll ops
#incidents-criticalP1 incidents onlyOn-call + leads
#ops-escalationEscalations and handoffsLeads

Launch Day Checklist

T-30 Minutes (Before Launch)

  • Verify all US data sources are healthy
  • Confirm DB replication is within acceptable lag (<30s)
  • Check backup verification passed for last 24 hours
  • Verify Redis cache is warm and responsive
  • Confirm PgBouncer connection pool healthy (<70% usage)
  • Test API health endpoint: curl -sf https://api.buywhere.ai/health
  • Verify monitoring dashboards accessible
  • Confirm Slack #us-launch-ops channel is active
  • Verify on-call rotation is set in PagerDuty
  • Check US-specific feature flags are configured

T-0 (Launch - 00:00 ET April 23)

  • Announce in #us-launch-ops: "US Launch commencing"
  • Monitor initial traffic spike for first 5 minutes
  • Watch API latency metrics (P50, P95, P99)
  • Monitor error rate for first 15 minutes
  • Verify first US ingestion runs complete successfully
  • Check fleet health after first batch of scrapers
  • Confirm auto-scaling triggered if needed

T+15 Minutes

  • Verify P50 latency <200ms
  • Verify error rate <0.5%
  • Confirm all US sources running
  • Check any new alerts fired
  • Update #us-launch-ops with status: "Launch nominal"

T+1 Hour

  • Run full health check across all systems
  • Verify DB metrics stable
  • Confirm backup cron ran successfully
  • Monitor any latency degradation
  • Check for any stalled ingestion runs

T+4 Hours

  • Review system performance baseline
  • Confirm all alerts are resolved or acknowledged
  • Verify US data quality scores meet threshold (>0.7)
  • Update #us-launch-ops with milestone: "First US data refresh complete"

T+8 Hours

  • Confirm stable operation for extended period
  • Review logs for any warning patterns
  • Verify next backup scheduled
  • Handoff from launch team to on-call if different

T+24 Hours (Post-Launch Day 1)

  • Confirm all US sources fresh (last run <24h)
  • Verify no data quality regressions
  • Confirm no P1/P2 incidents occurred
  • Update PagerDuty schedule for Day 2
  • Post launch summary to #us-launch-ops

T+48 Hours (Post-Launch Day 2)

  • Review all monitoring trends
  • Confirm auto-scaling behaved correctly
  • Verify no long-term issues from launch traffic
  • Archive launch monitoring dashboard
  • Conduct quick retro notes

US-Specific Monitoring Setup

Slack Alert Configuration

The #us-alerts channel is configured via webhook integration. To set up:

# 1. Create Slack webhook for US alerts
# Go to: https://api.slack.com/apps/<app-id>/incoming-webhooks

# 2. Add webhook URL to environment
SLACK_US_ALERTS_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ

# 3. Verify webhook configuration
curl -X POST $SLACK_US_ALERTS_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{"text": "US alerts webhook test"}'

Alert Routing Rules

# prometheus_alerts.yml additions for US launch

groups:
  - name: us_launch_alerts
    rules:
      - alert: USDataIngestionStale
        expr: hours_since_last_run{source=~"us_.*"} > 12
        for: 5m
        labels:
          severity: warning
          team: us-ops
        annotations:
          summary: "US source {{ $labels.source }} data is stale"
          channel: "#us-alerts"

      - alert: USFleetErrorConcentration
        expr: fleet_error_concentration{region="us-east-1"} > 0.15
        for: 5m
        labels:
          severity: critical
          team: us-ops
        annotations:
          summary: "US fleet error concentration {{ $value }}%"
          channel: "#us-alerts"

Monitoring Dashboard

Access at: https://grafana.buywhere.ai/d/us-launch

Key panels:

  • US API Latency (P50/P95/P99)
  • US Error Rate
  • US Data Freshness by Source
  • US Fleet Health
  • US Infrastructure Metrics

Rollback Plan

Criteria for Rollback

Initiate rollback if ANY of:

  • P1 incident active for >30 minutes
  • 50% of US sources failed

  • Data corruption affecting US market
  • Security incident affecting US user data
  • P95 latency >5s for >15 minutes
  • Error rate >10% for >10 minutes

Rollback Procedure

# 1. Announce rollback initiation
# Post in #us-launch-ops: "ROLLBACK INITIATED - Reason: <brief description>"

# 2. Disable US-specific features (feature flags)
curl -X PATCH $API_URL/admin/features \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"us_market_enabled": false}'

# 3. Route traffic away from US region (if multi-region)
# Update Route53 to point to staging/sg region only

# 4. Pause US ingestion
curl -X POST $API_URL/admin/ingestion/pause \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"sources": ["us_*"]}'

# 5. Verify rollback
curl -sf https://api.buywhere.ai/health
curl -sf https://api.buywhere.ai/metrics | grep error_rate

# 6. Post rollback status to #incidents

Data Rollback

If data corruption detected:

# 1. Stop all US ingestion immediately
curl -X POST $API_URL/admin/ingestion/stop \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 2. Identify last known good backup
./scripts/backup.sh list daily | grep -E "^us_.*good"

# 3. Restore US-specific data from backup
./scripts/backup.sh restore /var/backups/buywhere/daily/us_backup_YYYYMMDD.gz us_catalog

# 4. Verify data integrity
psql -h localhost -U buywhere -d catalog -c "SELECT COUNT(*) FROM products WHERE source LIKE 'us_%';"

# 5. Resume ingestion with verification
curl -X POST $API_URL/admin/ingestion/resume \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Incident Response Procedures

P1 (Critical) Response

# 1. Acknowledge in PagerDuty within 5 minutes
pd incident ack <incident-id>

# 2. Join #incidents-critical channel
# Post initial message:
"""
🔴 P1 INCIDENT: <brief description>
Impact: <what's affected>
Time: <when started>
On-call: <your name>
Initial assessment: <2 sentence assessment>
"""

# 3. Immediate diagnostics
curl -sf https://api.buywhere.ai/health
curl -sf https://api.buywhere.ai/metrics | grep -E "error_rate|latency"
docker-compose -f docker-compose.prod.yml ps

# 4. If API down:
docker-compose -f docker-compose.prod.yml restart api
curl -sf https://api.buywhere.ai/health

# 5. If DB issue:
./scripts/failover_replica.sh --check-only
# Follow disaster_recovery_runbook.md procedures

# 6. If scraper fleet issue:
./scripts/trigger_us_scrapers.sh --all

# 7. Update every 15 minutes in #incidents-critical
# Resolution update every 30 minutes

P2 (High) Response

# 1. Acknowledge within 15 minutes
pd incident ack <incident-id>

# 2. Post in #incidents:
"""
🟠 P2 INCIDENT: <brief description>
Impact: <what's affected>
On-call: <your name>
Initial assessment: <2 sentence assessment>
"""

# 3. Investigate root cause
# Check logs: docker-compose logs api --tail=100 | grep ERROR
# Check metrics: curl -sf https://api.buywhere.ai/metrics
# Check DB: docker-compose ps db

# 4. Apply fix or escalate
# If fix available within 30 min, apply and verify
# If not, escalate to P1

# 5. Update status every 30 minutes

Post-Incident Documentation

Within 24 hours of any P1/P2 incident:

# Incident Report: [BUY-XXXX] - [Title]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P1 / P2
**Root Cause:** [Brief description]

## Timeline
- HH:MM - Alert fired / Incident detected
- HH:MM - On-call acknowledged
- HH:MM - Root cause identified
- HH:MM - Fix applied
- HH:MM - Incident resolved

## Impact
- Users affected: [number/estimate]
- Data loss: [yes/no/amount]
- Downtime: [duration]

## Root Cause
[Detailed explanation of what went wrong]

## Resolution
[What was done to fix the issue]

## Action Items
- [ ] Action item 1 (owner: @name)
- [ ] Action item 2 (owner: @name)

## Lessons Learned
[What we learned from this incident]

Pre-Launch Verification

Complete this checklist by April 21 (2 days before launch):

Infrastructure

  • All ECS tasks healthy
  • Database replication lag <30s
  • Redis cluster responsive
  • PgBouncer pool usage <70%
  • Backup verification passing
  • Disaster recovery tested

Monitoring

  • Grafana dashboards created and accessible
  • Alert rules deployed and firing correctly
  • PagerDuty schedule configured
  • Slack channels created and permissions set
  • Webhook integrations tested

Data

  • US data sources configured
  • Ingestion pipeline tested
  • Data quality checks passing (>0.7)
  • US feature flags configured

Runbooks

  • This runbook reviewed and approved
  • All team members have access
  • Escalation contacts verified
  • On-call training completed

Contact Information

RoleNameContact
Primary On-Call@ops-primary-oncallPagerDuty
Secondary On-Call@ops-secondary-oncallPagerDuty
Ops Lead (Bolt)@boltSlack DM, @bolt
Engineering Manager@eng-managerSlack DM
CTO@ctoSlack DM (P1 only)
Securitysecurity@buywhere.aiEmail

Appendix

Key Endpoints

EndpointPurpose
https://api.buywhere.ai/healthAPI health check
https://api.buywhere.ai/metricsPrometheus metrics
https://grafana.buywhere.ai/d/us-launchUS launch dashboard

Key Scripts

ScriptLocationPurpose
trigger_us_scrapers.sh/app/scripts/Trigger US data refresh
failover_replica.sh/app/scripts/DB failover
backup.sh/app/scripts/Backup operations
ecs-autoscaling.sh/app/scripts/ECS scaling

Runbook Maintenance

  • Review Frequency: Before each major launch
  • Last Reviewed: 2026-04-18
  • Version Control: Git (docs/us_launch_runbook.md)

End of Document