US Launch Ops Runbook
Document Version: 1.0 Last Updated: 2026-04-18 Owner: Ops Team Launch Date: April 23, 2026 Classification: Internal - Confidential
Overview
This runbook covers all operational procedures for the BuyWhere US launch on April 23, 2026. It includes monitoring thresholds, incident response protocols, launch day checklists, and rollback procedures specific to US market operations.
Quick Reference
Incident Severity Levels
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| P1 - Critical | Complete service outage or data loss | 15 minutes | API down, DB failure, all scrapers failing |
| P2 - High | Major feature broken, significant impact | 30 minutes | Product search failing, auth issues |
| P3 - Medium | Degraded performance, minor feature broken | 2 hours | Slow responses, non-critical errors |
| P4 - Low | Minor issues, cosmetic problems | Next business day | UI typos, non-urgent warnings |
Response Times by Severity
- P1 Critical: Acknowledge in 5 min, engage in 15 min, resolve in 1 hour
- P2 High: Acknowledge in 15 min, engage in 30 min, resolve in 4 hours
- P3 Medium: Acknowledge in 1 hour, engage in 2 hours, resolve in 24 hours
- P4 Low: Acknowledge in 8 hours, resolve in 48 hours
US Launch Monitoring Thresholds
API Performance
| Metric | Warning | Critical | Auto-Action |
|---|---|---|---|
| API Latency P50 | >200ms | >500ms | Scale up |
| API Latency P95 | >500ms | >1s | Scale up |
| API Latency P99 | >1s | >2s | Scale up + alert |
| Error Rate (5xx) | >0.5% | >1% | Page on-call |
| Request Rate | >1000/min | >2000/min | Scale up |
| CPU Utilization | >60% | >80% | Scale up |
Data Ingestion (US Sources)
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Last Run Age | >12 hours | >24 hours | Trigger re-scrape |
| Error Rate | >5% | >10% | Investigate |
| Quality Score | <0.7 | <0.5 | Pause source |
| Stale Data % | >20% | >40% | Refresh data |
Fleet Health
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Unhealthy Scrapers | >10% | >25% | Auto-heartbeat |
| Fleet Error Concentration | >15% | >25% | Auto-heartbeat |
| Replication Lag | >30s | >60s | Page on-call |
Infrastructure (US Region)
| Metric | Warning | Critical | Action |
|---|---|---|---|
| DB Connections | >70% | >90% | Scale or optimize |
| DB CPU | >60% | >85% | Scale up |
| Redis Memory | >70% | >85% | Evict or scale |
| Disk Space | >70% | >85% | Clean up |
| Backup Age | >24 hours | >48 hours | Immediate backup |
Escalation Path
Primary On-Call Rotation
| Time (ET) | Primary | Secondary |
|---|---|---|
| 00:00 - 08:00 | @ops-night-oncall | @ops-day-oncall |
| 08:00 - 16:00 | @ops-day-oncall | @ops-evening-oncall |
| 16:00 - 24:00 | @ops-evening-oncall | @ops-night-oncall |
Escalation Chain
P1 (Critical):
→ @ops-primary-oncall (PagerDuty)
→ @ops-lead (Bolt) (if no response in 10 min)
→ @cto (if no response in 20 min)
→ All hands (if no response in 30 min)
P2 (High):
→ @ops-primary-oncall (Slack + PagerDuty)
→ @ops-lead (if no response in 30 min)
P3 (Medium):
→ @ops-oncall (Slack)
→ @ops-lead (next morning if unresolved)
P4 (Low):
→ @ops-ticket-queue (next business day)
Slack Channels
| Channel | Purpose | Access |
|---|---|---|
#us-launch-ops | US launch operations, real-time monitoring | All ops |
#us-alerts | US-specific alerts (auto-created) | On-call |
#incidents | General incidents | All ops |
#incidents-critical | P1 incidents only | On-call + leads |
#ops-escalation | Escalations and handoffs | Leads |
Launch Day Checklist
T-30 Minutes (Before Launch)
- Verify all US data sources are healthy
- Confirm DB replication is within acceptable lag (<30s)
- Check backup verification passed for last 24 hours
- Verify Redis cache is warm and responsive
- Confirm PgBouncer connection pool healthy (<70% usage)
- Test API health endpoint:
curl -sf https://api.buywhere.ai/health - Verify monitoring dashboards accessible
- Confirm Slack
#us-launch-opschannel is active - Verify on-call rotation is set in PagerDuty
- Check US-specific feature flags are configured
T-0 (Launch - 00:00 ET April 23)
- Announce in
#us-launch-ops: "US Launch commencing" - Monitor initial traffic spike for first 5 minutes
- Watch API latency metrics (P50, P95, P99)
- Monitor error rate for first 15 minutes
- Verify first US ingestion runs complete successfully
- Check fleet health after first batch of scrapers
- Confirm auto-scaling triggered if needed
T+15 Minutes
- Verify P50 latency <200ms
- Verify error rate <0.5%
- Confirm all US sources running
- Check any new alerts fired
- Update
#us-launch-opswith status: "Launch nominal"
T+1 Hour
- Run full health check across all systems
- Verify DB metrics stable
- Confirm backup cron ran successfully
- Monitor any latency degradation
- Check for any stalled ingestion runs
T+4 Hours
- Review system performance baseline
- Confirm all alerts are resolved or acknowledged
- Verify US data quality scores meet threshold (>0.7)
- Update
#us-launch-opswith milestone: "First US data refresh complete"
T+8 Hours
- Confirm stable operation for extended period
- Review logs for any warning patterns
- Verify next backup scheduled
- Handoff from launch team to on-call if different
T+24 Hours (Post-Launch Day 1)
- Confirm all US sources fresh (last run <24h)
- Verify no data quality regressions
- Confirm no P1/P2 incidents occurred
- Update PagerDuty schedule for Day 2
- Post launch summary to
#us-launch-ops
T+48 Hours (Post-Launch Day 2)
- Review all monitoring trends
- Confirm auto-scaling behaved correctly
- Verify no long-term issues from launch traffic
- Archive launch monitoring dashboard
- Conduct quick retro notes
US-Specific Monitoring Setup
Slack Alert Configuration
The #us-alerts channel is configured via webhook integration. To set up:
# 1. Create Slack webhook for US alerts
# Go to: https://api.slack.com/apps/<app-id>/incoming-webhooks
# 2. Add webhook URL to environment
SLACK_US_ALERTS_WEBHOOK_URL=https://hooks.slack.com/services/XXX/YYY/ZZZ
# 3. Verify webhook configuration
curl -X POST $SLACK_US_ALERTS_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d '{"text": "US alerts webhook test"}'
Alert Routing Rules
# prometheus_alerts.yml additions for US launch
groups:
- name: us_launch_alerts
rules:
- alert: USDataIngestionStale
expr: hours_since_last_run{source=~"us_.*"} > 12
for: 5m
labels:
severity: warning
team: us-ops
annotations:
summary: "US source {{ $labels.source }} data is stale"
channel: "#us-alerts"
- alert: USFleetErrorConcentration
expr: fleet_error_concentration{region="us-east-1"} > 0.15
for: 5m
labels:
severity: critical
team: us-ops
annotations:
summary: "US fleet error concentration {{ $value }}%"
channel: "#us-alerts"
Monitoring Dashboard
Access at: https://grafana.buywhere.ai/d/us-launch
Key panels:
- US API Latency (P50/P95/P99)
- US Error Rate
- US Data Freshness by Source
- US Fleet Health
- US Infrastructure Metrics
Rollback Plan
Criteria for Rollback
Initiate rollback if ANY of:
- P1 incident active for >30 minutes
-
50% of US sources failed
- Data corruption affecting US market
- Security incident affecting US user data
- P95 latency >5s for >15 minutes
- Error rate >10% for >10 minutes
Rollback Procedure
# 1. Announce rollback initiation
# Post in #us-launch-ops: "ROLLBACK INITIATED - Reason: <brief description>"
# 2. Disable US-specific features (feature flags)
curl -X PATCH $API_URL/admin/features \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"us_market_enabled": false}'
# 3. Route traffic away from US region (if multi-region)
# Update Route53 to point to staging/sg region only
# 4. Pause US ingestion
curl -X POST $API_URL/admin/ingestion/pause \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-d '{"sources": ["us_*"]}'
# 5. Verify rollback
curl -sf https://api.buywhere.ai/health
curl -sf https://api.buywhere.ai/metrics | grep error_rate
# 6. Post rollback status to #incidents
Data Rollback
If data corruption detected:
# 1. Stop all US ingestion immediately
curl -X POST $API_URL/admin/ingestion/stop \
-H "Authorization: Bearer $ADMIN_TOKEN"
# 2. Identify last known good backup
./scripts/backup.sh list daily | grep -E "^us_.*good"
# 3. Restore US-specific data from backup
./scripts/backup.sh restore /var/backups/buywhere/daily/us_backup_YYYYMMDD.gz us_catalog
# 4. Verify data integrity
psql -h localhost -U buywhere -d catalog -c "SELECT COUNT(*) FROM products WHERE source LIKE 'us_%';"
# 5. Resume ingestion with verification
curl -X POST $API_URL/admin/ingestion/resume \
-H "Authorization: Bearer $ADMIN_TOKEN"
Incident Response Procedures
P1 (Critical) Response
# 1. Acknowledge in PagerDuty within 5 minutes
pd incident ack <incident-id>
# 2. Join #incidents-critical channel
# Post initial message:
"""
🔴 P1 INCIDENT: <brief description>
Impact: <what's affected>
Time: <when started>
On-call: <your name>
Initial assessment: <2 sentence assessment>
"""
# 3. Immediate diagnostics
curl -sf https://api.buywhere.ai/health
curl -sf https://api.buywhere.ai/metrics | grep -E "error_rate|latency"
docker-compose -f docker-compose.prod.yml ps
# 4. If API down:
docker-compose -f docker-compose.prod.yml restart api
curl -sf https://api.buywhere.ai/health
# 5. If DB issue:
./scripts/failover_replica.sh --check-only
# Follow disaster_recovery_runbook.md procedures
# 6. If scraper fleet issue:
./scripts/trigger_us_scrapers.sh --all
# 7. Update every 15 minutes in #incidents-critical
# Resolution update every 30 minutes
P2 (High) Response
# 1. Acknowledge within 15 minutes
pd incident ack <incident-id>
# 2. Post in #incidents:
"""
🟠 P2 INCIDENT: <brief description>
Impact: <what's affected>
On-call: <your name>
Initial assessment: <2 sentence assessment>
"""
# 3. Investigate root cause
# Check logs: docker-compose logs api --tail=100 | grep ERROR
# Check metrics: curl -sf https://api.buywhere.ai/metrics
# Check DB: docker-compose ps db
# 4. Apply fix or escalate
# If fix available within 30 min, apply and verify
# If not, escalate to P1
# 5. Update status every 30 minutes
Post-Incident Documentation
Within 24 hours of any P1/P2 incident:
# Incident Report: [BUY-XXXX] - [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P1 / P2
**Root Cause:** [Brief description]
## Timeline
- HH:MM - Alert fired / Incident detected
- HH:MM - On-call acknowledged
- HH:MM - Root cause identified
- HH:MM - Fix applied
- HH:MM - Incident resolved
## Impact
- Users affected: [number/estimate]
- Data loss: [yes/no/amount]
- Downtime: [duration]
## Root Cause
[Detailed explanation of what went wrong]
## Resolution
[What was done to fix the issue]
## Action Items
- [ ] Action item 1 (owner: @name)
- [ ] Action item 2 (owner: @name)
## Lessons Learned
[What we learned from this incident]
Pre-Launch Verification
Complete this checklist by April 21 (2 days before launch):
Infrastructure
- All ECS tasks healthy
- Database replication lag <30s
- Redis cluster responsive
- PgBouncer pool usage <70%
- Backup verification passing
- Disaster recovery tested
Monitoring
- Grafana dashboards created and accessible
- Alert rules deployed and firing correctly
- PagerDuty schedule configured
- Slack channels created and permissions set
- Webhook integrations tested
Data
- US data sources configured
- Ingestion pipeline tested
- Data quality checks passing (>0.7)
- US feature flags configured
Runbooks
- This runbook reviewed and approved
- All team members have access
- Escalation contacts verified
- On-call training completed
Contact Information
| Role | Name | Contact |
|---|---|---|
| Primary On-Call | @ops-primary-oncall | PagerDuty |
| Secondary On-Call | @ops-secondary-oncall | PagerDuty |
| Ops Lead (Bolt) | @bolt | Slack DM, @bolt |
| Engineering Manager | @eng-manager | Slack DM |
| CTO | @cto | Slack DM (P1 only) |
| Security | security@buywhere.ai |
Appendix
Key Endpoints
| Endpoint | Purpose |
|---|---|
https://api.buywhere.ai/health | API health check |
https://api.buywhere.ai/metrics | Prometheus metrics |
https://grafana.buywhere.ai/d/us-launch | US launch dashboard |
Key Scripts
| Script | Location | Purpose |
|---|---|---|
trigger_us_scrapers.sh | /app/scripts/ | Trigger US data refresh |
failover_replica.sh | /app/scripts/ | DB failover |
backup.sh | /app/scripts/ | Backup operations |
ecs-autoscaling.sh | /app/scripts/ | ECS scaling |
Runbook Maintenance
- Review Frequency: Before each major launch
- Last Reviewed: 2026-04-18
- Version Control: Git (
docs/us_launch_runbook.md)
End of Document