us_launch_runbook

US Launch Ops Runbook

Document Version: 1.0 Last Updated: 2026-04-18 Owner: Ops Team Launch Date: April 23, 2026 Classification: Internal - Confidential

Overview

This runbook covers all operational procedures for the BuyWhere US launch on April 23, 2026. It includes monitoring thresholds, incident response protocols, launch day checklists, and rollback procedures specific to US market operations.

Quick Reference

Incident Severity Levels

Severity	Definition	Response Time	Example
P1 - Critical	Complete service outage or data loss	15 minutes	API down, DB failure, all scrapers failing
P2 - High	Major feature broken, significant impact	30 minutes	Product search failing, auth issues
P3 - Medium	Degraded performance, minor feature broken	2 hours	Slow responses, non-critical errors
P4 - Low	Minor issues, cosmetic problems	Next business day	UI typos, non-urgent warnings

Response Times by Severity

P1 Critical: Acknowledge in 5 min, engage in 15 min, resolve in 1 hour
P2 High: Acknowledge in 15 min, engage in 30 min, resolve in 4 hours
P3 Medium: Acknowledge in 1 hour, engage in 2 hours, resolve in 24 hours
P4 Low: Acknowledge in 8 hours, resolve in 48 hours

US Launch Monitoring Thresholds

API Performance

Metric	Warning	Critical	Auto-Action
API Latency P50	>200ms	>500ms	Scale up
API Latency P95	>500ms	>1s	Scale up
API Latency P99	>1s	>2s	Scale up + alert
Error Rate (5xx)	>0.5%	>1%	Page on-call
Request Rate	>1000/min	>2000/min	Scale up
CPU Utilization	>60%	>80%	Scale up

Data Ingestion (US Sources)

Metric	Warning	Critical	Action
Last Run Age	>12 hours	>24 hours	Trigger re-scrape
Error Rate	>5%	>10%	Investigate
Quality Score	<0.7	<0.5	Pause source
Stale Data %	>20%	>40%	Refresh data

Fleet Health

Metric	Warning	Critical	Action
Unhealthy Scrapers	>10%	>25%	Auto-heartbeat
Fleet Error Concentration	>15%	>25%	Auto-heartbeat
Replication Lag	>30s	>60s	Page on-call

Infrastructure (US Region)

Metric	Warning	Critical	Action
DB Connections	>70%	>90%	Scale or optimize
DB CPU	>60%	>85%	Scale up
Redis Memory	>70%	>85%	Evict or scale
Disk Space	>70%	>85%	Clean up
Backup Age	>24 hours	>48 hours	Immediate backup

Escalation Path

Primary On-Call Rotation

Time (ET)	Primary	Secondary
00:00 - 08:00	@ops-night-oncall	@ops-day-oncall
08:00 - 16:00	@ops-day-oncall	@ops-evening-oncall
16:00 - 24:00	@ops-evening-oncall	@ops-night-oncall

Escalation Chain

P1 (Critical):
  → @ops-primary-oncall (PagerDuty)
  → @ops-lead (Bolt) (if no response in 10 min)
  → @cto (if no response in 20 min)
  → All hands (if no response in 30 min)

P2 (High):
  → @ops-primary-oncall (Slack + PagerDuty)
  → @ops-lead (if no response in 30 min)

P3 (Medium):
  → @ops-oncall (Slack)
  → @ops-lead (next morning if unresolved)

P4 (Low):
  → @ops-ticket-queue (next business day)

Slack Channels

Channel	Purpose	Access
`#us-launch-ops`	US launch operations, real-time monitoring	All ops
`#us-alerts`	US-specific alerts (auto-created)	On-call
`#incidents`	General incidents	All ops
`#incidents-critical`	P1 incidents only	On-call + leads
`#ops-escalation`	Escalations and handoffs	Leads

Launch Day Checklist

T-30 Minutes (Before Launch)

Verify all US data sources are healthy
Confirm DB replication is within acceptable lag (<30s)
Check backup verification passed for last 24 hours
Verify Redis cache is warm and responsive
Confirm PgBouncer connection pool healthy (<70% usage)
Test API health endpoint: curl -sf https://api.buywhere.ai/health
Verify monitoring dashboards accessible
Confirm Slack #us-launch-ops channel is active
Verify on-call rotation is set in PagerDuty
Check US-specific feature flags are configured

T-0 (Launch - 00:00 ET April 23)

Announce in #us-launch-ops: "US Launch commencing"
Monitor initial traffic spike for first 5 minutes
Watch API latency metrics (P50, P95, P99)
Monitor error rate for first 15 minutes
Verify first US ingestion runs complete successfully
Check fleet health after first batch of scrapers
Confirm auto-scaling triggered if needed

T+15 Minutes

Verify P50 latency <200ms
Verify error rate <0.5%
Confirm all US sources running
Check any new alerts fired
Update #us-launch-ops with status: "Launch nominal"

T+1 Hour

Run full health check across all systems
Verify DB metrics stable
Confirm backup cron ran successfully
Monitor any latency degradation
Check for any stalled ingestion runs

T+4 Hours

Review system performance baseline
Confirm all alerts are resolved or acknowledged
Verify US data quality scores meet threshold (>0.7)
Update #us-launch-ops with milestone: "First US data refresh complete"

T+8 Hours

Confirm stable operation for extended period
Review logs for any warning patterns
Verify next backup scheduled
Handoff from launch team to on-call if different

T+24 Hours (Post-Launch Day 1)

Confirm all US sources fresh (last run <24h)
Verify no data quality regressions
Confirm no P1/P2 incidents occurred
Update PagerDuty schedule for Day 2
Post launch summary to #us-launch-ops

T+48 Hours (Post-Launch Day 2)

Review all monitoring trends
Confirm auto-scaling behaved correctly
Verify no long-term issues from launch traffic
Archive launch monitoring dashboard
Conduct quick retro notes

US-Specific Monitoring Setup

Slack Alert Configuration

The #us-alerts channel is configured via webhook integration. To set up:

# 1. Create Slack webhook for US alerts
# Go to: https://api.slack.com/apps/<app-id>/incoming-webhooks

# 2. Add webhook URL to environment
SLACK_US_ALERTS_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ

# 3. Verify webhook configuration
curl -X POST $SLACK_US_ALERTS_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{"text": "US alerts webhook test"}'

Alert Routing Rules

# prometheus_alerts.yml additions for US launch

groups:
  - name: us_launch_alerts
    rules:
      - alert: USDataIngestionStale
        expr: hours_since_last_run{source=~"us_.*"} > 12
        for: 5m
        labels:
          severity: warning
          team: us-ops
        annotations:
          summary: "US source {{ $labels.source }} data is stale"
          channel: "#us-alerts"

      - alert: USFleetErrorConcentration
        expr: fleet_error_concentration{region="us-east-1"} > 0.15
        for: 5m
        labels:
          severity: critical
          team: us-ops
        annotations:
          summary: "US fleet error concentration {{ $value }}%"
          channel: "#us-alerts"

Monitoring Dashboard

Access at: https://grafana.buywhere.ai/d/us-launch

Key panels:

US API Latency (P50/P95/P99)
US Error Rate
US Data Freshness by Source
US Fleet Health
US Infrastructure Metrics

Rollback Plan

Criteria for Rollback

Initiate rollback if ANY of:

P1 incident active for >30 minutes
50% of US sources failed
Data corruption affecting US market
Security incident affecting US user data
P95 latency >5s for >15 minutes
Error rate >10% for >10 minutes

Rollback Procedure

# 1. Announce rollback initiation
# Post in #us-launch-ops: "ROLLBACK INITIATED - Reason: <brief description>"

# 2. Disable US-specific features (feature flags)
curl -X PATCH $API_URL/admin/features \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"us_market_enabled": false}'

# 3. Route traffic away from US region (if multi-region)
# Update Route53 to point to staging/sg region only

# 4. Pause US ingestion
curl -X POST $API_URL/admin/ingestion/pause \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -d '{"sources": ["us_*"]}'

# 5. Verify rollback
curl -sf https://api.buywhere.ai/health
curl -sf https://api.buywhere.ai/metrics | grep error_rate

# 6. Post rollback status to #incidents

Data Rollback

If data corruption detected:

# 1. Stop all US ingestion immediately
curl -X POST $API_URL/admin/ingestion/stop \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# 2. Identify last known good backup
./scripts/backup.sh list daily | grep -E "^us_.*good"

# 3. Restore US-specific data from backup
./scripts/backup.sh restore /var/backups/buywhere/daily/us_backup_YYYYMMDD.gz us_catalog

# 4. Verify data integrity
psql -h localhost -U buywhere -d catalog -c "SELECT COUNT(*) FROM products WHERE source LIKE 'us_%';"

# 5. Resume ingestion with verification
curl -X POST $API_URL/admin/ingestion/resume \
  -H "Authorization: Bearer $ADMIN_TOKEN"

Incident Response Procedures

P1 (Critical) Response

# 1. Acknowledge in PagerDuty within 5 minutes
pd incident ack <incident-id>

# 2. Join #incidents-critical channel
# Post initial message:
"""
🔴 P1 INCIDENT: <brief description>
Impact: <what's affected>
Time: <when started>
On-call: <your name>
Initial assessment: <2 sentence assessment>
"""

# 3. Immediate diagnostics
curl -sf https://api.buywhere.ai/health
curl -sf https://api.buywhere.ai/metrics | grep -E "error_rate|latency"
docker-compose -f docker-compose.prod.yml ps

# 4. If API down:
docker-compose -f docker-compose.prod.yml restart api
curl -sf https://api.buywhere.ai/health

# 5. If DB issue:
./scripts/failover_replica.sh --check-only
# Follow disaster_recovery_runbook.md procedures

# 6. If scraper fleet issue:
./scripts/trigger_us_scrapers.sh --all

# 7. Update every 15 minutes in #incidents-critical
# Resolution update every 30 minutes

P2 (High) Response

# 1. Acknowledge within 15 minutes
pd incident ack <incident-id>

# 2. Post in #incidents:
"""
🟠 P2 INCIDENT: <brief description>
Impact: <what's affected>
On-call: <your name>
Initial assessment: <2 sentence assessment>
"""

# 3. Investigate root cause
# Check logs: docker-compose logs api --tail=100 | grep ERROR
# Check metrics: curl -sf https://api.buywhere.ai/metrics
# Check DB: docker-compose ps db

# 4. Apply fix or escalate
# If fix available within 30 min, apply and verify
# If not, escalate to P1

# 5. Update status every 30 minutes

Post-Incident Documentation

Within 24 hours of any P1/P2 incident:

# Incident Report: [BUY-XXXX] - [Title]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P1 / P2
**Root Cause:** [Brief description]

## Timeline
- HH:MM - Alert fired / Incident detected
- HH:MM - On-call acknowledged
- HH:MM - Root cause identified
- HH:MM - Fix applied
- HH:MM - Incident resolved

## Impact
- Users affected: [number/estimate]
- Data loss: [yes/no/amount]
- Downtime: [duration]

## Root Cause
[Detailed explanation of what went wrong]

## Resolution
[What was done to fix the issue]

## Action Items
- [ ] Action item 1 (owner: @name)
- [ ] Action item 2 (owner: @name)

## Lessons Learned
[What we learned from this incident]

Pre-Launch Verification

Complete this checklist by April 21 (2 days before launch):

Infrastructure

All ECS tasks healthy
Database replication lag <30s
Redis cluster responsive
PgBouncer pool usage <70%
Backup verification passing
Disaster recovery tested

Monitoring

Grafana dashboards created and accessible
Alert rules deployed and firing correctly
PagerDuty schedule configured
Slack channels created and permissions set
Webhook integrations tested

Data

US data sources configured
Ingestion pipeline tested
Data quality checks passing (>0.7)
US feature flags configured

Runbooks

This runbook reviewed and approved
All team members have access
Escalation contacts verified
On-call training completed

Contact Information

Role	Name	Contact
Primary On-Call	@ops-primary-oncall	PagerDuty
Secondary On-Call	@ops-secondary-oncall	PagerDuty
Ops Lead (Bolt)	@bolt	Slack DM, @bolt
Engineering Manager	@eng-manager	Slack DM
CTO	@cto	Slack DM (P1 only)
Security	security@buywhere.ai	Email

Appendix

Key Endpoints

Endpoint	Purpose
`https://api.buywhere.ai/health`	API health check
`https://api.buywhere.ai/metrics`	Prometheus metrics
`https://grafana.buywhere.ai/d/us-launch`	US launch dashboard

Key Scripts

Script	Location	Purpose
`trigger_us_scrapers.sh`	`/app/scripts/`	Trigger US data refresh
`failover_replica.sh`	`/app/scripts/`	DB failover
`backup.sh`	`/app/scripts/`	Backup operations
`ecs-autoscaling.sh`	`/app/scripts/`	ECS scaling

Runbook Maintenance

Review Frequency: Before each major launch
Last Reviewed: 2026-04-18
Version Control: Git (docs/us_launch_runbook.md)

End of Document