Scraper Fleet Runbook
Status: Active Last Updated: 2026-04-16 Owner: DevOps (Pipe agent)
Overview
This runbook covers common failure modes for the BuyWhere scraper worker fleet, diagnosis procedures, and remediation steps.
Quick Reference
| Alert Type | Severity | Action |
|---|---|---|
| Restart storm (>5 restarts/10min) | Warning | Investigate logs, check rate limits |
| Restart storm (>10 restarts/10min) | Critical | Immediate investigation required |
| Process down >5min | Critical | Restart or escalate |
| Stalled scrape (>24h since last run) | Warning | Check scheduler health |
| High failure rate (>50% products failed) | Warning | Investigate scraping logic |
Process Supervision
PM2 (Recommended for development/local)
# Install PM2
npm install -g pm2
# Start the scraper scheduler with PM2
pm2 start ecosystem.scraper.json
# View logs
pm2 logs scraper-scheduler
# View restart count
pm2 list
# Restart a process
pm2 restart scraper-scheduler
# Stop a process
pm2 stop scraper-scheduler
# Delete a process
pm2 delete scraper-scheduler
Systemd (Recommended for production)
# Install the service
sudo cp config/scraper-fleet.service /etc/systemd/system/
sudo systemctl daemon-reload
# Enable and start
sudo systemctl enable scraper-fleet
sudo systemctl start scraper-fleet
# Check status
sudo systemctl status scraper-fleet
# View logs
sudo journalctl -u scraper-fleet -f
# Restart
sudo systemctl restart scraper-fleet
Common Failure Modes
1. Restart Storm (Process Keeps Crashing)
Symptoms:
- PM2 shows high restart count for
scraper-scheduler - Logs show repeated "Starting scrape" followed by crashes
- Paperclip alert:
restart_storm_warningorrestart_storm_critical
Diagnosis:
# Check PM2 restart count
pm2 list
# View recent error logs
pm2 logs scraper-scheduler --err --lines 100
# Check for Python exceptions
grep -i "exception\|error\|traceback" logs/pm2-scraper-scheduler-error.log | tail -50
Common Causes:
- Database connection failure (check PostgreSQL)
- Missing environment variables (API keys)
- Import errors in scraper modules
- Memory exhaustion (OOM killer)
Remediation:
# 1. Check database connectivity
psql -h localhost -U buywhere -d catalog -c "SELECT 1"
# 2. Check environment
env | grep -E "DATABASE|REDIS|API"
# 3. If memory issue, restart with smaller worker count
pm2 restart scraper-scheduler --env SCRAPER_WORKERS=1
# 4. If module import error, check for syntax errors
python -c "import scrapers; from scripts import scraper_scheduler"
2. Process Down (Not Running)
Symptoms:
- PM2 shows
stoppedorerroredstatus - Systemd shows
inactiveorfailed - No scraper runs in
schedule_log.json
Diagnosis:
# PM2
pm2 list
pm2 logs scraper-scheduler --err
# Systemd
sudo systemctl status scraper-fleet
sudo journalctl -u scraper-fleet -n 50
Remediation:
# PM2
pm2 start ecosystem.scraper.json
# Systemd
sudo systemctl start scraper-fleet
# Verify
sleep 5 && pm2 list # or systemctl status scraper-fleet
3. Stalled Scheduler (No Scrapes Running)
Symptoms:
- Scheduler process is running but no scrape activity
schedule_log.jsonshows old timestamps- No new products being ingested
Diagnosis:
# Check if scheduler is actually running scrapes
tail -100 logs/pm2-scraper-scheduler-out.log
# Check schedule log
cat schedule_log.json | python3 -m json.tool | grep -A5 "last_run"
# Check lock files (should be cleaned up after each run)
ls -la /tmp/buywhere_scraper_locks/
Remediation:
# If stale locks exist, remove them
rm -f /tmp/buywhere_scraper_locks/*.lock
# Restart the scheduler
pm2 restart scraper-scheduler
# Or run a one-shot manually
python -m scripts.scraper_scheduler --once
4. High Failure Rate (>50% Products Failed)
Symptoms:
- Scrapers run but most products fail ingestion
rows_failedis high inschedule_log.json- API returns 4xx/5xx errors
Diagnosis:
# Check recent ingestion logs
grep -i "failed\|error" logs/catalog_ingest_*.log | tail -50
# Test API connectivity
curl -s http://localhost:8000/health
# Check API key validity
curl -s -H "Authorization: Bearer $API_KEY" http://localhost:8000/v1/products | head -c 200
Remediation:
- Verify API key has not expired
- Check API rate limits
- Check database connection pool settings
5. Rate Limiting by Target Site
Symptoms:
- Scrapers get HTTP 429 responses
- Increasing wait times between retries
- Slow progress or no new products
Diagnosis:
# Check scraper logs for rate limit messages
grep -i "rate.limit\|429\|wait" logs/*scraper*.log | tail -30
Remediation:
- Increase delay between requests (edit scraper module)
- Use proxy rotation (configure in scraper)
- Reduce concurrent workers
- Contact site for API partnership
Health Monitoring
Health Endpoint
Each scraper worker exposes a health endpoint:
# Check health
curl http://localhost:9090/health
# Check liveness (simple alive check)
curl http://localhost:9090/health/live
# Check readiness (can accept work)
curl http://localhost:9090/health/ready
PM2 Key Metrics
# Monitor in real-time
pm2 monit
# View detailed info
pm2 info scraper-scheduler
# Check memory usage
pm2 list | grep scraper
Alerting
Paperclip Integration
Restart storms and process down events automatically create issues in Paperclip:
- Restart Storm Warning (>5 restarts/10min) → Medium priority issue
- Restart Storm Critical (>10 restarts/10min) → High priority issue
- Process Down (>5min) → High priority issue
Manual Alert Check
# Check recent Paperclip alerts
curl -s -H "Authorization: Bearer $PAPERCLIP_API_KEY" \
"$PAPERCLIP_API_URL/api/issues?status=open&labels=alert" | \
python3 -m json.tool
Escalation
If issue cannot be resolved within 30 minutes:
- Check parent issue BUY-1281 for context
- Escalate to VP DevOps (Bolt) via Paperclip
- If critical revenue impact, escalate to CTO (Rex)
Files Reference
| File | Purpose |
|---|---|
ecosystem.scraper.json | PM2 configuration |
config/scraper-fleet.service | Systemd unit file |
scripts/health_check_server.py | Health endpoint server |
scripts/monitor_scraper_fleet.py | Restart storm monitor |
scripts/scraper_scheduler.py | Main scheduler script |
logs/pm2-*.log | PM2 stdout/stderr logs |
/tmp/buywhere_scraper_locks/*.lock | Platform lock files |
Appendix: PM2 Ecosystem Config
{
"apps": [{
"name": "scraper-scheduler",
"script": "python",
"args": "-m scripts.scraper_scheduler --continuous",
"instances": 1,
"autorestart": true,
"max_restarts": 10,
"min_uptime": "30s",
"exp_backoff_restart_delay": 1000,
"error_file": "logs/pm2-scraper-scheduler-error.log",
"out_file": "logs/pm2-scraper-scheduler-out.log"
}]
}
Key PM2 settings:
max_restarts: 10— Maximum restarts before PM2 stops tryingmin_uptime: 30s— Process must run at least 30s to be considered stableexp_backoff_restart_delay: 1000— Exponential backoff starts at 1s, doubles each restart