on-call-runbook

BuyWhere On-Call Runbook

Status: Active Last Updated: 2026-04-19 Owner: Ops Team Version: 1.1 Classification: Internal

Overview

This runbook covers on-call procedures for BuyWhere API services. It includes alert response workflows, escalation paths, and incident management procedures.

Launch Day Note (BUY-3415): For US launch on April 23, 2026, ensure all launch-day specific alerts with launch_critical: "true" label are actioned immediately. See Launch Day Runbook for full launch procedures.

On-Call Rotation

Primary Contacts

Role	Agent	Contact	Hours
Primary On-Call	PagerDuty	Auto-page	24/7
DevOps Lead	Bolt	@bolt	Business hours
CTO	Rex	@cto	P1 only

Escalation Path

P1 (Critical — service down or data at risk):
  On-call → Bolt (10 min no response) → CTO (20 min no response)

P2 (High — major feature broken):
  On-call → Bolt (30 min no response)

P3 (Medium):
  On-call → Bolt (next business day)

LAUNCH DAY (April 23, 2026):
  P1 Critical → Rex (immediate) → Vera (5 min no response)
  P2 High → Bolt (immediate) → Rex (15 min no response)

Alert Sources

Source	Channel	Description
PagerDuty	SMS/Phone/Email	Critical infrastructure alerts
Slack	`#alerts`	Prometheus/Alertmanager webhooks
Slack	`#us-alerts`	US-specific uptime alerts
Sentry	https://sentry.io/organizations/buywhere/	Application errors

Critical Alerts (Page Immediately)

HighErrorRate

Description: 5xx error rate exceeds 1% for 5 minutes
Action: Page on-call immediately
Runbook: See Deployment Rollback

APIServiceDown

Description: API health endpoint unreachable for 2+ minutes
Action: Page on-call, begin incident response
Runbook: Check API pods, restart if needed

DatabasePoolExhausted

Description: DB connection pool > 90% utilized
Action: Page on-call, check for connection leaks
Runbook: Restart API or scale connections

ReplicationLagCritical

Description: PostgreSQL replication lag > 5 seconds
Action: Page on-call, check replica health
Runbook: May require failover

Launch-Day Critical Alerts (BUY-3415)

LaunchAffiliateTrackingFailure

Description: Error rate on /go endpoint > 1% for 3 minutes during launch
Action: Page on-call IMMEDIATELY — affiliate revenue impacted
Severity: P1 during launch window

LaunchSearchQualityDegradation

Description: >50% of compare requests returning zero matches
Action: Page on-call immediately — core user experience impacted
Severity: P1 during launch window

LaunchDatabasePoolPressure

Description: DB connection pool > 80% utilized during launch
Action: Monitor closely, prepare to scale connections
Severity: P2 — escalate before it hits 90%

LaunchExternalUptimeCheckFailed

Description: External monitoring (blackbox) detects API or website down
Action: Page on-call immediately — customers affected
Severity: P1 during launch window

LaunchTrafficSpike

Description: Request rate > 100 req/s (elevated above baseline)
Action: Monitor capacity, watch for downstream pressure
Severity: Info — alert for awareness

Warning Alerts (Monitor & Respond)

Alert	Threshold	Action
HighLatencyP95	P95 > 1s for 5min	Investigate, monitor
HighLatencyP99	P99 > 2.5s for 5min	Page if sustained
DiskSpaceHigh	Disk > 80%	Clean up logs, old files
ReplicationLagHigh	Lag > 2s for 5min	Monitor, investigate

Alert Response Workflow

Step 1: Acknowledge

Receive PagerDuty page or Slack alert
Acknowledge alert in PagerDuty dashboard
Join #incidents Slack channel

Step 2: Assess

Check Grafana dashboard: https://grafana.buywhere.io/d/api-main
Check Loki logs: https://grafana.buywhere.io/d/logs
Determine severity (P1/P2/P3)

Step 3: Respond

For P1 (Critical):

# Check API health
curl -sf https://api.buywhere.ai/health | jq '.'

# Check pod status (Kubernetes)
kubectl get pods -n production -l app=buywhere-api

# Check recent deployments
kubectl rollout history deployment/buywhere-api -n production

# If deployment issue, rollback
kubectl rollout undo deployment/buywhere-api -n production

For P2 (High):

Monitor for 15 minutes
Prepare rollback if error rate increases
Update #incidents with status

For P3 (Medium):

Log investigation in #incidents
Schedule fix for next business day

Step 4: Resolve

Confirm metric returns to normal
Update PagerDuty with resolution notes
Post all-clear to #incidents
File incident report if P1/P2

Quick Reference Commands

# API health check
curl -sf https://api.buywhere.ai/health | jq '.'

# Detailed health (includes DB/Redis)
curl -sf https://api.buywhere.io/health/detailed | jq '.'

# Prometheus metrics
curl -sf https://api.buywhere.io/metrics | head -50

# Check recent errors in Loki
kubectl logs -n production -l app=buywhere-api --tail=100 | grep -i error

# Restart API pods
kubectl rollout restart deployment/buywhere-api -n production

# Rollback to previous version
kubectl rollout undo deployment/buywhere-api -n production

Runbooks

Document	Purpose
Deployment Runbook	Deployment, rollback, thresholds
Launch Day Runbook	Event-specific launch procedures
Backup/Restore Runbook	Database backup and restore
Disaster Recovery Runbook	Major incident recovery
Scraper Fleet Runbook	Scraper monitoring and recovery
Emergency API Scaling	Auto-scaling procedures