← Back to documentation

scraper-fleet-runbook

Scraper Fleet Runbook

Status: Active Last Updated: 2026-04-16 Owner: DevOps (Pipe agent)


Overview

This runbook covers common failure modes for the BuyWhere scraper worker fleet, diagnosis procedures, and remediation steps.


Quick Reference

Alert TypeSeverityAction
Restart storm (>5 restarts/10min)WarningInvestigate logs, check rate limits
Restart storm (>10 restarts/10min)CriticalImmediate investigation required
Process down >5minCriticalRestart or escalate
Stalled scrape (>24h since last run)WarningCheck scheduler health
High failure rate (>50% products failed)WarningInvestigate scraping logic

Process Supervision

PM2 (Recommended for development/local)

# Install PM2
npm install -g pm2

# Start the scraper scheduler with PM2
pm2 start ecosystem.scraper.json

# View logs
pm2 logs scraper-scheduler

# View restart count
pm2 list

# Restart a process
pm2 restart scraper-scheduler

# Stop a process
pm2 stop scraper-scheduler

# Delete a process
pm2 delete scraper-scheduler

Systemd (Recommended for production)

# Install the service
sudo cp config/scraper-fleet.service /etc/systemd/system/
sudo systemctl daemon-reload

# Enable and start
sudo systemctl enable scraper-fleet
sudo systemctl start scraper-fleet

# Check status
sudo systemctl status scraper-fleet

# View logs
sudo journalctl -u scraper-fleet -f

# Restart
sudo systemctl restart scraper-fleet

Common Failure Modes

1. Restart Storm (Process Keeps Crashing)

Symptoms:

  • PM2 shows high restart count for scraper-scheduler
  • Logs show repeated "Starting scrape" followed by crashes
  • Paperclip alert: restart_storm_warning or restart_storm_critical

Diagnosis:

# Check PM2 restart count
pm2 list

# View recent error logs
pm2 logs scraper-scheduler --err --lines 100

# Check for Python exceptions
grep -i "exception\|error\|traceback" logs/pm2-scraper-scheduler-error.log | tail -50

Common Causes:

  1. Database connection failure (check PostgreSQL)
  2. Missing environment variables (API keys)
  3. Import errors in scraper modules
  4. Memory exhaustion (OOM killer)

Remediation:

# 1. Check database connectivity
psql -h localhost -U buywhere -d catalog -c "SELECT 1"

# 2. Check environment
env | grep -E "DATABASE|REDIS|API"

# 3. If memory issue, restart with smaller worker count
pm2 restart scraper-scheduler --env SCRAPER_WORKERS=1

# 4. If module import error, check for syntax errors
python -c "import scrapers; from scripts import scraper_scheduler"

2. Process Down (Not Running)

Symptoms:

  • PM2 shows stopped or errored status
  • Systemd shows inactive or failed
  • No scraper runs in schedule_log.json

Diagnosis:

# PM2
pm2 list
pm2 logs scraper-scheduler --err

# Systemd
sudo systemctl status scraper-fleet
sudo journalctl -u scraper-fleet -n 50

Remediation:

# PM2
pm2 start ecosystem.scraper.json

# Systemd
sudo systemctl start scraper-fleet

# Verify
sleep 5 && pm2 list  # or systemctl status scraper-fleet

3. Stalled Scheduler (No Scrapes Running)

Symptoms:

  • Scheduler process is running but no scrape activity
  • schedule_log.json shows old timestamps
  • No new products being ingested

Diagnosis:

# Check if scheduler is actually running scrapes
tail -100 logs/pm2-scraper-scheduler-out.log

# Check schedule log
cat schedule_log.json | python3 -m json.tool | grep -A5 "last_run"

# Check lock files (should be cleaned up after each run)
ls -la /tmp/buywhere_scraper_locks/

Remediation:

# If stale locks exist, remove them
rm -f /tmp/buywhere_scraper_locks/*.lock

# Restart the scheduler
pm2 restart scraper-scheduler

# Or run a one-shot manually
python -m scripts.scraper_scheduler --once

4. High Failure Rate (>50% Products Failed)

Symptoms:

  • Scrapers run but most products fail ingestion
  • rows_failed is high in schedule_log.json
  • API returns 4xx/5xx errors

Diagnosis:

# Check recent ingestion logs
grep -i "failed\|error" logs/catalog_ingest_*.log | tail -50

# Test API connectivity
curl -s http://localhost:8000/health

# Check API key validity
curl -s -H "Authorization: Bearer $API_KEY" http://localhost:8000/v1/products | head -c 200

Remediation:

  1. Verify API key has not expired
  2. Check API rate limits
  3. Check database connection pool settings

5. Rate Limiting by Target Site

Symptoms:

  • Scrapers get HTTP 429 responses
  • Increasing wait times between retries
  • Slow progress or no new products

Diagnosis:

# Check scraper logs for rate limit messages
grep -i "rate.limit\|429\|wait" logs/*scraper*.log | tail -30

Remediation:

  1. Increase delay between requests (edit scraper module)
  2. Use proxy rotation (configure in scraper)
  3. Reduce concurrent workers
  4. Contact site for API partnership

Health Monitoring

Health Endpoint

Each scraper worker exposes a health endpoint:

# Check health
curl http://localhost:9090/health

# Check liveness (simple alive check)
curl http://localhost:9090/health/live

# Check readiness (can accept work)
curl http://localhost:9090/health/ready

PM2 Key Metrics

# Monitor in real-time
pm2 monit

# View detailed info
pm2 info scraper-scheduler

# Check memory usage
pm2 list | grep scraper

Alerting

Paperclip Integration

Restart storms and process down events automatically create issues in Paperclip:

  1. Restart Storm Warning (>5 restarts/10min) → Medium priority issue
  2. Restart Storm Critical (>10 restarts/10min) → High priority issue
  3. Process Down (>5min) → High priority issue

Manual Alert Check

# Check recent Paperclip alerts
curl -s -H "Authorization: Bearer $PAPERCLIP_API_KEY" \
  "$PAPERCLIP_API_URL/api/issues?status=open&labels=alert" | \
  python3 -m json.tool

Escalation

If issue cannot be resolved within 30 minutes:

  1. Check parent issue BUY-1281 for context
  2. Escalate to VP DevOps (Bolt) via Paperclip
  3. If critical revenue impact, escalate to CTO (Rex)

Files Reference

FilePurpose
ecosystem.scraper.jsonPM2 configuration
config/scraper-fleet.serviceSystemd unit file
scripts/health_check_server.pyHealth endpoint server
scripts/monitor_scraper_fleet.pyRestart storm monitor
scripts/scraper_scheduler.pyMain scheduler script
logs/pm2-*.logPM2 stdout/stderr logs
/tmp/buywhere_scraper_locks/*.lockPlatform lock files

Appendix: PM2 Ecosystem Config

{
  "apps": [{
    "name": "scraper-scheduler",
    "script": "python",
    "args": "-m scripts.scraper_scheduler --continuous",
    "instances": 1,
    "autorestart": true,
    "max_restarts": 10,
    "min_uptime": "30s",
    "exp_backoff_restart_delay": 1000,
    "error_file": "logs/pm2-scraper-scheduler-error.log",
    "out_file": "logs/pm2-scraper-scheduler-out.log"
  }]
}

Key PM2 settings:

  • max_restarts: 10 — Maximum restarts before PM2 stops trying
  • min_uptime: 30s — Process must run at least 30s to be considered stable
  • exp_backoff_restart_delay: 1000 — Exponential backoff starts at 1s, doubles each restart