scraper-fleet-runbook

Scraper Fleet Runbook

Status: Active Last Updated: 2026-04-16 Owner: DevOps (Pipe agent)

Overview

This runbook covers common failure modes for the BuyWhere scraper worker fleet, diagnosis procedures, and remediation steps.

Quick Reference

Alert Type	Severity	Action
Restart storm (>5 restarts/10min)	Warning	Investigate logs, check rate limits
Restart storm (>10 restarts/10min)	Critical	Immediate investigation required
Process down >5min	Critical	Restart or escalate
Stalled scrape (>24h since last run)	Warning	Check scheduler health
High failure rate (>50% products failed)	Warning	Investigate scraping logic

Process Supervision

PM2 (Recommended for development/local)

# Install PM2
npm install -g pm2

# Start the scraper scheduler with PM2
pm2 start ecosystem.scraper.json

# View logs
pm2 logs scraper-scheduler

# View restart count
pm2 list

# Restart a process
pm2 restart scraper-scheduler

# Stop a process
pm2 stop scraper-scheduler

# Delete a process
pm2 delete scraper-scheduler

Systemd (Recommended for production)

# Install the service
sudo cp config/scraper-fleet.service /etc/systemd/system/
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable scraper-fleet
sudo systemctl start scraper-fleet

# Check status
sudo systemctl status scraper-fleet

# View logs
sudo journalctl -u scraper-fleet -f

# Restart
sudo systemctl restart scraper-fleet

Common Failure Modes

1. Restart Storm (Process Keeps Crashing)

Symptoms:

PM2 shows high restart count for scraper-scheduler
Logs show repeated "Starting scrape" followed by crashes
Paperclip alert: restart_storm_warning or restart_storm_critical

Diagnosis:

# Check PM2 restart count
pm2 list

# View recent error logs
pm2 logs scraper-scheduler --err --lines 100

# Check for Python exceptions
grep -i "exception\|error\|traceback" logs/pm2-scraper-scheduler-error.log | tail -50

Common Causes:

Database connection failure (check PostgreSQL)
Missing environment variables (API keys)
Import errors in scraper modules
Memory exhaustion (OOM killer)

Remediation:

# 1. Check database connectivity
psql -h localhost -U buywhere -d catalog -c "SELECT 1"

# 2. Check environment
env | grep -E "DATABASE|REDIS|API"

# 3. If memory issue, restart with smaller worker count
pm2 restart scraper-scheduler --env SCRAPER_WORKERS=1

# 4. If module import error, check for syntax errors
python -c "import scrapers; from scripts import scraper_scheduler"

2. Process Down (Not Running)

Symptoms:

PM2 shows stopped or errored status
Systemd shows inactive or failed
No scraper runs in schedule_log.json

Diagnosis:

# PM2
pm2 list
pm2 logs scraper-scheduler --err

# Systemd
sudo systemctl status scraper-fleet
sudo journalctl -u scraper-fleet -n 50

Remediation:

# PM2
pm2 start ecosystem.scraper.json

# Systemd
sudo systemctl start scraper-fleet

# Verify
sleep 5 && pm2 list  # or systemctl status scraper-fleet

3. Stalled Scheduler (No Scrapes Running)

Symptoms:

Scheduler process is running but no scrape activity
schedule_log.json shows old timestamps
No new products being ingested

Diagnosis:

# Check if scheduler is actually running scrapes
tail -100 logs/pm2-scraper-scheduler-out.log

# Check schedule log
cat schedule_log.json | python3 -m json.tool | grep -A5 "last_run"

# Check lock files (should be cleaned up after each run)
ls -la /tmp/buywhere_scraper_locks/

Remediation:

# If stale locks exist, remove them
rm -f /tmp/buywhere_scraper_locks/*.lock

# Restart the scheduler
pm2 restart scraper-scheduler

# Or run a one-shot manually
python -m scripts.scraper_scheduler --once

4. High Failure Rate (>50% Products Failed)

Symptoms:

Scrapers run but most products fail ingestion
rows_failed is high in schedule_log.json
API returns 4xx/5xx errors

Diagnosis:

# Check recent ingestion logs
grep -i "failed\|error" logs/catalog_ingest_*.log | tail -50

# Test API connectivity
curl -s http://localhost:8000/health

# Check API key validity
curl -s -H "Authorization: Bearer $API_KEY" http://localhost:8000/v1/products | head -c 200

Remediation:

Verify API key has not expired
Check API rate limits
Check database connection pool settings

5. Rate Limiting by Target Site

Symptoms:

Scrapers get HTTP 429 responses
Increasing wait times between retries
Slow progress or no new products

Diagnosis:

# Check scraper logs for rate limit messages
grep -i "rate.limit\|429\|wait" logs/*scraper*.log | tail -30

Remediation:

Increase delay between requests (edit scraper module)
Use proxy rotation (configure in scraper)
Reduce concurrent workers
Contact site for API partnership

Health Monitoring

Health Endpoint

Each scraper worker exposes a health endpoint:

# Check health
curl http://localhost:9090/health

# Check liveness (simple alive check)
curl http://localhost:9090/health/live

# Check readiness (can accept work)
curl http://localhost:9090/health/ready

PM2 Key Metrics

# Monitor in real-time
pm2 monit

# View detailed info
pm2 info scraper-scheduler

# Check memory usage
pm2 list | grep scraper

Alerting

Paperclip Integration

Restart storms and process down events automatically create issues in Paperclip:

Restart Storm Warning (>5 restarts/10min) → Medium priority issue
Restart Storm Critical (>10 restarts/10min) → High priority issue
Process Down (>5min) → High priority issue

Manual Alert Check

# Check recent Paperclip alerts
curl -s -H "Authorization: Bearer $PAPERCLIP_API_KEY" \
  "$PAPERCLIP_API_URL/api/issues?status=open&labels=alert" | \
  python3 -m json.tool

Escalation

If issue cannot be resolved within 30 minutes:

Check parent issue BUY-1281 for context
Escalate to VP DevOps (Bolt) via Paperclip
If critical revenue impact, escalate to CTO (Rex)

Files Reference

File	Purpose
`ecosystem.scraper.json`	PM2 configuration
`config/scraper-fleet.service`	Systemd unit file
`scripts/health_check_server.py`	Health endpoint server
`scripts/monitor_scraper_fleet.py`	Restart storm monitor
`scripts/scraper_scheduler.py`	Main scheduler script
`logs/pm2-*.log`	PM2 stdout/stderr logs
`/tmp/buywhere_scraper_locks/*.lock`	Platform lock files

Appendix: PM2 Ecosystem Config

{
  "apps": [{
    "name": "scraper-scheduler",
    "script": "python",
    "args": "-m scripts.scraper_scheduler --continuous",
    "instances": 1,
    "autorestart": true,
    "max_restarts": 10,
    "min_uptime": "30s",
    "exp_backoff_restart_delay": 1000,
    "error_file": "logs/pm2-scraper-scheduler-error.log",
    "out_file": "logs/pm2-scraper-scheduler-out.log"
  }]
}

Key PM2 settings:

max_restarts: 10 — Maximum restarts before PM2 stops trying
min_uptime: 30s — Process must run at least 30s to be considered stable
exp_backoff_restart_delay: 1000 — Exponential backoff starts at 1s, doubles each restart