BuyWhere On-Call Runbook
Status: Active Last Updated: 2026-04-19 Owner: Ops Team Version: 1.1 Classification: Internal
Overview
This runbook covers on-call procedures for BuyWhere API services. It includes alert response workflows, escalation paths, and incident management procedures.
Launch Day Note (BUY-3415): For US launch on April 23, 2026, ensure all launch-day specific alerts with launch_critical: "true" label are actioned immediately. See Launch Day Runbook for full launch procedures.
On-Call Rotation
Primary Contacts
| Role | Agent | Contact | Hours |
|---|---|---|---|
| Primary On-Call | PagerDuty | Auto-page | 24/7 |
| DevOps Lead | Bolt | @bolt | Business hours |
| CTO | Rex | @cto | P1 only |
Escalation Path
P1 (Critical — service down or data at risk):
On-call → Bolt (10 min no response) → CTO (20 min no response)
P2 (High — major feature broken):
On-call → Bolt (30 min no response)
P3 (Medium):
On-call → Bolt (next business day)
LAUNCH DAY (April 23, 2026):
P1 Critical → Rex (immediate) → Vera (5 min no response)
P2 High → Bolt (immediate) → Rex (15 min no response)
Alert Sources
| Source | Channel | Description |
|---|---|---|
| PagerDuty | SMS/Phone/Email | Critical infrastructure alerts |
| Slack | #alerts | Prometheus/Alertmanager webhooks |
| Slack | #us-alerts | US-specific uptime alerts |
| Sentry | https://sentry.io/organizations/buywhere/ | Application errors |
Critical Alerts (Page Immediately)
HighErrorRate
- Description: 5xx error rate exceeds 1% for 5 minutes
- Action: Page on-call immediately
- Runbook: See Deployment Rollback
APIServiceDown
- Description: API health endpoint unreachable for 2+ minutes
- Action: Page on-call, begin incident response
- Runbook: Check API pods, restart if needed
DatabasePoolExhausted
- Description: DB connection pool > 90% utilized
- Action: Page on-call, check for connection leaks
- Runbook: Restart API or scale connections
ReplicationLagCritical
- Description: PostgreSQL replication lag > 5 seconds
- Action: Page on-call, check replica health
- Runbook: May require failover
Launch-Day Critical Alerts (BUY-3415)
LaunchAffiliateTrackingFailure
- Description: Error rate on /go endpoint > 1% for 3 minutes during launch
- Action: Page on-call IMMEDIATELY — affiliate revenue impacted
- Severity: P1 during launch window
LaunchSearchQualityDegradation
- Description: >50% of compare requests returning zero matches
- Action: Page on-call immediately — core user experience impacted
- Severity: P1 during launch window
LaunchDatabasePoolPressure
- Description: DB connection pool > 80% utilized during launch
- Action: Monitor closely, prepare to scale connections
- Severity: P2 — escalate before it hits 90%
LaunchExternalUptimeCheckFailed
- Description: External monitoring (blackbox) detects API or website down
- Action: Page on-call immediately — customers affected
- Severity: P1 during launch window
LaunchTrafficSpike
- Description: Request rate > 100 req/s (elevated above baseline)
- Action: Monitor capacity, watch for downstream pressure
- Severity: Info — alert for awareness
Warning Alerts (Monitor & Respond)
| Alert | Threshold | Action |
|---|---|---|
| HighLatencyP95 | P95 > 1s for 5min | Investigate, monitor |
| HighLatencyP99 | P99 > 2.5s for 5min | Page if sustained |
| DiskSpaceHigh | Disk > 80% | Clean up logs, old files |
| ReplicationLagHigh | Lag > 2s for 5min | Monitor, investigate |
Alert Response Workflow
Step 1: Acknowledge
- Receive PagerDuty page or Slack alert
- Acknowledge alert in PagerDuty dashboard
- Join
#incidentsSlack channel
Step 2: Assess
- Check Grafana dashboard: https://grafana.buywhere.io/d/api-main
- Check Loki logs: https://grafana.buywhere.io/d/logs
- Determine severity (P1/P2/P3)
Step 3: Respond
For P1 (Critical):
# Check API health
curl -sf https://api.buywhere.ai/health | jq '.'
# Check pod status (Kubernetes)
kubectl get pods -n production -l app=buywhere-api
# Check recent deployments
kubectl rollout history deployment/buywhere-api -n production
# If deployment issue, rollback
kubectl rollout undo deployment/buywhere-api -n production
For P2 (High):
- Monitor for 15 minutes
- Prepare rollback if error rate increases
- Update
#incidentswith status
For P3 (Medium):
- Log investigation in
#incidents - Schedule fix for next business day
Step 4: Resolve
- Confirm metric returns to normal
- Update PagerDuty with resolution notes
- Post all-clear to
#incidents - File incident report if P1/P2
Quick Reference Commands
# API health check
curl -sf https://api.buywhere.ai/health | jq '.'
# Detailed health (includes DB/Redis)
curl -sf https://api.buywhere.io/health/detailed | jq '.'
# Prometheus metrics
curl -sf https://api.buywhere.io/metrics | head -50
# Check recent errors in Loki
kubectl logs -n production -l app=buywhere-api --tail=100 | grep -i error
# Restart API pods
kubectl rollout restart deployment/buywhere-api -n production
# Rollback to previous version
kubectl rollout undo deployment/buywhere-api -n production
Runbooks
| Document | Purpose |
|---|---|
| Deployment Runbook | Deployment, rollback, thresholds |
| Launch Day Runbook | Event-specific launch procedures |
| Backup/Restore Runbook | Database backup and restore |
| Disaster Recovery Runbook | Major incident recovery |
| Scraper Fleet Runbook | Scraper monitoring and recovery |
| Emergency API Scaling | Auto-scaling procedures |
Related Documents
- Prometheus alerts:
prometheus_alerts.yml - Alert routing:
alertmanager.yml - Monitoring setup:
docs/monitoring.md - Uptime monitoring:
docs/uptime-monitoring-setup.md
End of Document