BUY-3442: Log Retention and Alert Baseline for Product API
Status
Baseline defined — implementation partial
1. Scope
- Service: Product API (
buywhere-api, all /v1/products, /v2/products, /compare endpoints)
- Proxy-free: Excludes sources using ScraperAPI or upstream proxies
- Launch-relevant: Focused on error classes and thresholds that directly impact launch readiness and user-facing product quality
2. Log Retention Windows
| Layer | Current | Baseline | Notes |
|---|
| Hot (Loki ingester) | 720h (30d) | 720h (30d) | No change; adequate for launch |
| Warm (S3 boltdb-shipper index) | 720h (30d) | 30d | Index on S3; same as hot for now |
| Cold (S3 chunks) | 720h (30d) | 90d | BUY-2880 plan specified 90d cold; not yet implemented |
| Local rotating files | 7d standard, 10d PM2 | 7d | Adequate; Loki is source of truth |
| Analytics JSONL (api_analytics.jsonl) | 5 backups × 100MB | 30d | Should align with Loki retention |
Gap: BUY-2880 Phase 4 specified hot:7d, warm:30d, cold:90d but only 30d is configured in Loki. Cold storage tier (90d) is not yet implemented.
Action needed: Update k8s/production/loki-configmap.yaml table_manager.retention_period to 2160h (90d) for cold/archival tier, or document that S3 lifecycle rules handle the 90d requirement.
3. Must-Watch Error Classes
3.1 HTTP Error Class Taxonomy
| Error Class | HTTP Codes | Root Cause | Impact |
|---|
5xx_server_error | 500, 502, 503, 504 | Backend failure, DB timeout, OOM | User-facing errors, failed reads |
4xx_client_error | 400, 401, 403, 404, 429 | Bad request, auth failure, rate limit, not found | Degraded functionality |
product_not_found | 404 on /products/* | Missing product, stale ID | Empty compare results |
ingestion_rejected | 422, 429 on write | Schema mismatch, rate limit | Data quality gaps |
timeout_error | 504 | Upstream latency, pool exhaustion | Failed user requests |
3.2 Scraper-to-API Failure Classes (proxy-free)
| Failure Class | Detection | Owner | Launch Relevance |
|---|
parse_failure | returncode != 0 after retries | Data-eng | Direct product gaps |
data_write_failure | rows_failed > 0 with status=completed | Backend-eng | Missing products in API |
ingestion_stall | hours_since_last_run > interval * 2 | On-call | Stale compare data |
scheduler_stall | Lock file staleness | Platform-eng | Fleet-wide stall |
3.3 Compare Experience Errors
| Error Class | Detection | Threshold | Severity |
|---|
zero_match_rate | % requests with 0 results > 30% | 30% over 30min | Warning |
compare_cache_hit_rate | cache_hits / cache_requests < 50% | 50% over 15min | Warning |
compare_ingestion_stale | last_ingestion_timestamp > 2h ago | 2h no ingestion | Critical |
4. Alert Thresholds — Product API
4.1 Existing PromQL Alerts (Validated)
| Alert | Expr | For | Severity | Threshold | Status |
|---|
HighErrorRate | rate(http_errors_total[5m]) / rate(http_requests_total[5m]) | 5m | critical | >1% | ✅ Implemented |
HighLatencyP95 | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) | 5m | warning | >1s | ✅ Implemented |
HighLatencyP99 | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) | 5m | critical | >2.5s | ✅ Implemented |
APIDown | up{job="buywhere-api"} == 0 | 2m | critical | 0 = down | ✅ Implemented |
DatabasePoolExhausted | db_connection_pool_checked_out / db_connection_pool_size >= 0.9 | 5m | warning | ≥90% | ✅ Implemented |
ComparePageIngestionStale | (time() - avg(last_over_time(buywhere_last_ingestion_timestamp[1h]))) > 7200 | 10m | critical | >2h stale | ✅ Implemented |
CompareMatchQualityDrop | rate(buywhere_compare_zero_matches_total[30m]) / rate(buywhere_compare_requests_total[30m]) > 0.3 | 30m | warning | >30% zero-match | ✅ Implemented |
CompareCacheHitRateLow | rate(buywhere_cache_hits_total{endpoint="/compare"}[15m]) / rate(buywhere_cache_requests_total{endpoint="/compare"}[15m]) < 0.5 | 15m | warning | <50% | ✅ Implemented |
4.2 Loki Log-Based Alerts (Validated)
| Alert | Expr | For | Severity | Threshold | Status |
|---|
HighErrorRate (Loki) | `sum(rate({job=~"buywhere-api | all-services"} |= "error" [5m])) by (service) / sum(rate({job=~"buywhere-api | all-services"}[5m])) by (service) > 0.01` | 5m | critical |
APIServiceDown | `sum(rate({job=~"buywhere-api | all-services"}[5m])) by (service) == 0` | 2m | critical | 0 logs |
NoLogsReceived | `sum(rate({job=~"buywhere-api | all-services"}[5m])) == 0` | 5m | warning | 0 logs |
4.3 Gap: Missing Product-API-Specific Alerts
| Gap | Severity | Recommendation |
|---|
No alert for product_not_found rate | warning | Add PromQL: rate(http_requests_total{status="404", path=~"/v[12]/products.*"}[5m]) / rate(http_requests_total{path=~"/v[12]/products.*"}[5m]) > 0.1 |
| No alert for compare zero-match critical threshold | critical | Raise to P1: > 0.5 (50%) for 5min |
| No alert for ingestion write failure rate | warning | Add: scraper_rows_failed_total / scraper_rows_total > 0.05 |
| No structured error budget | — | Define error budget policy (see §6) |
4.4 Alert Routing
Current: Alertmanager routes by severity label only:
critical → PagerDuty + webhook
warning → webhook (Slack #alerts)
Gap: No routing by failure_class or endpoint dimension. alert_routing_plan.md outlines failure-class routing but is unimplemented.
5. Validation Notes
| Check | Result | Notes |
|---|
| Prometheus alerts fire on 5xx | ✅ | HighErrorRate at 1% threshold confirmed |
| Loki log alerts fire on errors | ✅ | HighErrorRate Loki rule at 1% threshold confirmed |
| Log retention at 30d in Loki | ✅ | retention_period: 720h confirmed in loki-configmap.yaml |
| PM2 log rotation at 100MB / 5 backups | ✅ | Confirmed in request_logging.py (MAX_BYTES = 100 * 1024 * 1024, BACKUP_COUNT = 5) |
| Log files compressed after rotation | ✅ | compress, delaycompress in logrotate configs |
S3 bucket for Loki: buywhere-loki-centralized | ✅ | Confirmed in loki-configmap.yaml |
| Alertmanager PagerDuty routing | ✅ | critical-alerts-pagerduty receiver configured |
| Blackbox monitoring for API / MCP / Website | ✅ | Multiple up{job="blackbox-..."} alerts confirmed |
6. Remaining Risks
| Risk | Likelihood | Impact | Mitigation |
|---|
| Cold storage tier (90d) not implemented | High | Compliance gap if logs needed beyond 30d | File issue to extend Loki retention to 2160h; document S3 lifecycle rule as alternative |
| Error budget policy not defined | Medium | No SLO burn rate alerting | Create SLO document; add ErrorBudgetBurning alert at 0.5% (half of 1% budget) |
failure_class routing not implemented | Medium | All scraper alerts route to same on-call | Implement BUY-2880 follow-up: add failure_class label to Loki/Prometheus rules |
No structured product_not_found alerting | Low | 404s on products not distinguishable from other 404s | Add path-specific 404 alert (see §4.3) |
| Analytics JSONL retention not enforced | Low | api_analytics.jsonl rotates but not centrally retained | Confirm analytics logs flow to Loki or extend logrotate for 30d |
| Log gap between PM2 rotation and Loki pickup | Low | Brief window where logs exist only on disk | Fluent Bit reads from container logs; minor window acceptable for launch |
7. Summary: Baseline Configuration
RETENTION
Loki (hot/warm): 30 days [adequate for launch]
Loki (cold/archival): 90 days [NOT YET IMPLEMENTED — action needed]
Local files: 7 days [adequate; Loki is source of truth]
ERROR CLASSES (must-watch)
5xx server errors → critical at >1% over 5min
P95 latency → warning at >1s over 5min
P99 latency → critical at >2.5s over 5min
API down → critical at 0 for >2min
Compare zero-match → warning at >30% over 30min; critical at >50% over 5min
Compare cache hit rate → warning at <50% over 15min
Ingestion stale → critical at >2h since last run
DB pool exhaustion → warning at ≥90% over 5min
THRESHOLDS (actionable)
All above are configured in prometheus_alerts.yml and loki_rules/api_alerts.yaml
Gap: product-API-specific 404 rate alert not yet added
Gap: cold storage (90d) not yet implemented
8. Files Reviewed
| File | Purpose |
|---|
prometheus_alerts.yml | Prometheus alerting rules |
loki_rules/api_alerts.yaml | Loki log-based alerting rules |
k8s/production/loki-configmap.yaml | Loki retention config |
k8s/production/loki-alerts-configmap.yaml | K8s Loki alerting rules |
alertmanager.yml | Alert routing configuration |
app/request_logging.py | API request logging middleware |
app/logging_centralized.py | Centralized structured logger |
config/logrotate.conf | Host-level log rotation |
config/logrotate.host.conf | PM2 log rotation config |
alert_routing_plan.md | Scraper alert routing plan (unimplemented) |
docs/BUY-2880-logging-plan.md | Centralized logging plan (in progress) |