BuyWhere On-Call Runbook

Status: Active Last Updated: 2026-04-19 Owner: Ops Team Version: 1.1 Classification: Internal


Overview

This runbook covers on-call procedures for BuyWhere API services. It includes alert response workflows, escalation paths, and incident management procedures.

Launch Day Note (BUY-3415): For US launch on April 23, 2026, ensure all launch-day specific alerts with launch_critical: "true" label are actioned immediately. See Launch Day Runbook for full launch procedures.


On-Call Rotation

Primary Contacts

RoleAgentContactHours
Primary On-CallPagerDutyAuto-page24/7
DevOps LeadBolt@boltBusiness hours
CTORex@ctoP1 only

Escalation Path

P1 (Critical — service down or data at risk):
  On-call → Bolt (10 min no response) → CTO (20 min no response)

P2 (High — major feature broken):
  On-call → Bolt (30 min no response)

P3 (Medium):
  On-call → Bolt (next business day)

LAUNCH DAY (April 23, 2026):
  P1 Critical → Rex (immediate) → Vera (5 min no response)
  P2 High → Bolt (immediate) → Rex (15 min no response)

Alert Sources

SourceChannelDescription
PagerDutySMS/Phone/EmailCritical infrastructure alerts
Slack#alertsPrometheus/Alertmanager webhooks
Slack#us-alertsUS-specific uptime alerts
Sentryhttps://sentry.io/organizations/buywhere/Application errors

Critical Alerts (Page Immediately)

HighErrorRate

  • Description: 5xx error rate exceeds 1% for 5 minutes
  • Action: Page on-call immediately
  • Runbook: See Deployment Rollback

APIServiceDown

  • Description: API health endpoint unreachable for 2+ minutes
  • Action: Page on-call, begin incident response
  • Runbook: Check API pods, restart if needed

DatabasePoolExhausted

  • Description: DB connection pool > 90% utilized
  • Action: Page on-call, check for connection leaks
  • Runbook: Restart API or scale connections

ReplicationLagCritical

  • Description: PostgreSQL replication lag > 5 seconds
  • Action: Page on-call, check replica health
  • Runbook: May require failover

Launch-Day Critical Alerts (BUY-3415)

LaunchAffiliateTrackingFailure

  • Description: Error rate on /go endpoint > 1% for 3 minutes during launch
  • Action: Page on-call IMMEDIATELY — affiliate revenue impacted
  • Severity: P1 during launch window

LaunchSearchQualityDegradation

  • Description: >50% of compare requests returning zero matches
  • Action: Page on-call immediately — core user experience impacted
  • Severity: P1 during launch window

LaunchDatabasePoolPressure

  • Description: DB connection pool > 80% utilized during launch
  • Action: Monitor closely, prepare to scale connections
  • Severity: P2 — escalate before it hits 90%

LaunchExternalUptimeCheckFailed

  • Description: External monitoring (blackbox) detects API or website down
  • Action: Page on-call immediately — customers affected
  • Severity: P1 during launch window

LaunchTrafficSpike

  • Description: Request rate > 100 req/s (elevated above baseline)
  • Action: Monitor capacity, watch for downstream pressure
  • Severity: Info — alert for awareness

Warning Alerts (Monitor & Respond)

AlertThresholdAction
HighLatencyP95P95 > 1s for 5minInvestigate, monitor
HighLatencyP99P99 > 2.5s for 5minPage if sustained
DiskSpaceHighDisk > 80%Clean up logs, old files
ReplicationLagHighLag > 2s for 5minMonitor, investigate

Alert Response Workflow

Step 1: Acknowledge

  1. Receive PagerDuty page or Slack alert
  2. Acknowledge alert in PagerDuty dashboard
  3. Join #incidents Slack channel

Step 2: Assess

  1. Check Grafana dashboard: https://grafana.buywhere.io/d/api-main
  2. Check Loki logs: https://grafana.buywhere.io/d/logs
  3. Determine severity (P1/P2/P3)

Step 3: Respond

For P1 (Critical):

# Check API health
curl -sf https://api.buywhere.ai/health | jq '.'

# Check pod status (Kubernetes)
kubectl get pods -n production -l app=buywhere-api

# Check recent deployments
kubectl rollout history deployment/buywhere-api -n production

# If deployment issue, rollback
kubectl rollout undo deployment/buywhere-api -n production

For P2 (High):

  • Monitor for 15 minutes
  • Prepare rollback if error rate increases
  • Update #incidents with status

For P3 (Medium):

  • Log investigation in #incidents
  • Schedule fix for next business day

Step 4: Resolve

  1. Confirm metric returns to normal
  2. Update PagerDuty with resolution notes
  3. Post all-clear to #incidents
  4. File incident report if P1/P2

Quick Reference Commands

# API health check
curl -sf https://api.buywhere.ai/health | jq '.'

# Detailed health (includes DB/Redis)
curl -sf https://api.buywhere.io/health/detailed | jq '.'

# Prometheus metrics
curl -sf https://api.buywhere.io/metrics | head -50

# Check recent errors in Loki
kubectl logs -n production -l app=buywhere-api --tail=100 | grep -i error

# Restart API pods
kubectl rollout restart deployment/buywhere-api -n production

# Rollback to previous version
kubectl rollout undo deployment/buywhere-api -n production

Runbooks

DocumentPurpose
Deployment RunbookDeployment, rollback, thresholds
Launch Day RunbookEvent-specific launch procedures
Backup/Restore RunbookDatabase backup and restore
Disaster Recovery RunbookMajor incident recovery
Scraper Fleet RunbookScraper monitoring and recovery
Emergency API ScalingAuto-scaling procedures

Related Documents

  • Prometheus alerts: prometheus_alerts.yml
  • Alert routing: alertmanager.yml
  • Monitoring setup: docs/monitoring.md
  • Uptime monitoring: docs/uptime-monitoring-setup.md

End of Document