← Back to documentation

BUY-3442-log-retention-alert-baseline

BUY-3442: Log Retention and Alert Baseline for Product API

Status

Baseline defined — implementation partial


1. Scope

  • Service: Product API (buywhere-api, all /v1/products, /v2/products, /compare endpoints)
  • Proxy-free: Excludes sources using ScraperAPI or upstream proxies
  • Launch-relevant: Focused on error classes and thresholds that directly impact launch readiness and user-facing product quality

2. Log Retention Windows

LayerCurrentBaselineNotes
Hot (Loki ingester)720h (30d)720h (30d)No change; adequate for launch
Warm (S3 boltdb-shipper index)720h (30d)30dIndex on S3; same as hot for now
Cold (S3 chunks)720h (30d)90dBUY-2880 plan specified 90d cold; not yet implemented
Local rotating files7d standard, 10d PM27dAdequate; Loki is source of truth
Analytics JSONL (api_analytics.jsonl)5 backups × 100MB30dShould align with Loki retention

Gap: BUY-2880 Phase 4 specified hot:7d, warm:30d, cold:90d but only 30d is configured in Loki. Cold storage tier (90d) is not yet implemented.

Action needed: Update k8s/production/loki-configmap.yaml table_manager.retention_period to 2160h (90d) for cold/archival tier, or document that S3 lifecycle rules handle the 90d requirement.


3. Must-Watch Error Classes

3.1 HTTP Error Class Taxonomy

Error ClassHTTP CodesRoot CauseImpact
5xx_server_error500, 502, 503, 504Backend failure, DB timeout, OOMUser-facing errors, failed reads
4xx_client_error400, 401, 403, 404, 429Bad request, auth failure, rate limit, not foundDegraded functionality
product_not_found404 on /products/*Missing product, stale IDEmpty compare results
ingestion_rejected422, 429 on writeSchema mismatch, rate limitData quality gaps
timeout_error504Upstream latency, pool exhaustionFailed user requests

3.2 Scraper-to-API Failure Classes (proxy-free)

Failure ClassDetectionOwnerLaunch Relevance
parse_failurereturncode != 0 after retriesData-engDirect product gaps
data_write_failurerows_failed > 0 with status=completedBackend-engMissing products in API
ingestion_stallhours_since_last_run > interval * 2On-callStale compare data
scheduler_stallLock file stalenessPlatform-engFleet-wide stall

3.3 Compare Experience Errors

Error ClassDetectionThresholdSeverity
zero_match_rate% requests with 0 results > 30%30% over 30minWarning
compare_cache_hit_ratecache_hits / cache_requests < 50%50% over 15minWarning
compare_ingestion_stalelast_ingestion_timestamp > 2h ago2h no ingestionCritical

4. Alert Thresholds — Product API

4.1 Existing PromQL Alerts (Validated)

AlertExprForSeverityThresholdStatus
HighErrorRaterate(http_errors_total[5m]) / rate(http_requests_total[5m])5mcritical>1%✅ Implemented
HighLatencyP95histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))5mwarning>1s✅ Implemented
HighLatencyP99histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))5mcritical>2.5s✅ Implemented
APIDownup{job="buywhere-api"} == 02mcritical0 = down✅ Implemented
DatabasePoolExhausteddb_connection_pool_checked_out / db_connection_pool_size >= 0.95mwarning≥90%✅ Implemented
ComparePageIngestionStale(time() - avg(last_over_time(buywhere_last_ingestion_timestamp[1h]))) > 720010mcritical>2h stale✅ Implemented
CompareMatchQualityDroprate(buywhere_compare_zero_matches_total[30m]) / rate(buywhere_compare_requests_total[30m]) > 0.330mwarning>30% zero-match✅ Implemented
CompareCacheHitRateLowrate(buywhere_cache_hits_total{endpoint="/compare"}[15m]) / rate(buywhere_cache_requests_total{endpoint="/compare"}[15m]) < 0.515mwarning<50%✅ Implemented

4.2 Loki Log-Based Alerts (Validated)

AlertExprForSeverityThresholdStatus
HighErrorRate (Loki)`sum(rate({job=~"buywhere-apiall-services"} |= "error" [5m])) by (service) / sum(rate({job=~"buywhere-apiall-services"}[5m])) by (service) > 0.01`5mcritical
APIServiceDown`sum(rate({job=~"buywhere-apiall-services"}[5m])) by (service) == 0`2mcritical0 logs
NoLogsReceived`sum(rate({job=~"buywhere-apiall-services"}[5m])) == 0`5mwarning0 logs

4.3 Gap: Missing Product-API-Specific Alerts

GapSeverityRecommendation
No alert for product_not_found ratewarningAdd PromQL: rate(http_requests_total{status="404", path=~"/v[12]/products.*"}[5m]) / rate(http_requests_total{path=~"/v[12]/products.*"}[5m]) > 0.1
No alert for compare zero-match critical thresholdcriticalRaise to P1: > 0.5 (50%) for 5min
No alert for ingestion write failure ratewarningAdd: scraper_rows_failed_total / scraper_rows_total > 0.05
No structured error budgetDefine error budget policy (see §6)

4.4 Alert Routing

Current: Alertmanager routes by severity label only:

  • critical → PagerDuty + webhook
  • warning → webhook (Slack #alerts)

Gap: No routing by failure_class or endpoint dimension. alert_routing_plan.md outlines failure-class routing but is unimplemented.


5. Validation Notes

CheckResultNotes
Prometheus alerts fire on 5xxHighErrorRate at 1% threshold confirmed
Loki log alerts fire on errorsHighErrorRate Loki rule at 1% threshold confirmed
Log retention at 30d in Lokiretention_period: 720h confirmed in loki-configmap.yaml
PM2 log rotation at 100MB / 5 backupsConfirmed in request_logging.py (MAX_BYTES = 100 * 1024 * 1024, BACKUP_COUNT = 5)
Log files compressed after rotationcompress, delaycompress in logrotate configs
S3 bucket for Loki: buywhere-loki-centralizedConfirmed in loki-configmap.yaml
Alertmanager PagerDuty routingcritical-alerts-pagerduty receiver configured
Blackbox monitoring for API / MCP / WebsiteMultiple up{job="blackbox-..."} alerts confirmed

6. Remaining Risks

RiskLikelihoodImpactMitigation
Cold storage tier (90d) not implementedHighCompliance gap if logs needed beyond 30dFile issue to extend Loki retention to 2160h; document S3 lifecycle rule as alternative
Error budget policy not definedMediumNo SLO burn rate alertingCreate SLO document; add ErrorBudgetBurning alert at 0.5% (half of 1% budget)
failure_class routing not implementedMediumAll scraper alerts route to same on-callImplement BUY-2880 follow-up: add failure_class label to Loki/Prometheus rules
No structured product_not_found alertingLow404s on products not distinguishable from other 404sAdd path-specific 404 alert (see §4.3)
Analytics JSONL retention not enforcedLowapi_analytics.jsonl rotates but not centrally retainedConfirm analytics logs flow to Loki or extend logrotate for 30d
Log gap between PM2 rotation and Loki pickupLowBrief window where logs exist only on diskFluent Bit reads from container logs; minor window acceptable for launch

7. Summary: Baseline Configuration

RETENTION
  Loki (hot/warm):     30 days  [adequate for launch]
  Loki (cold/archival): 90 days [NOT YET IMPLEMENTED — action needed]
  Local files:          7 days  [adequate; Loki is source of truth]

ERROR CLASSES (must-watch)
  5xx server errors        → critical at >1% over 5min
  P95 latency             → warning at >1s over 5min
  P99 latency             → critical at >2.5s over 5min
  API down                → critical at 0 for >2min
  Compare zero-match      → warning at >30% over 30min; critical at >50% over 5min
  Compare cache hit rate   → warning at <50% over 15min
  Ingestion stale         → critical at >2h since last run
  DB pool exhaustion      → warning at ≥90% over 5min

THRESHOLDS (actionable)
  All above are configured in prometheus_alerts.yml and loki_rules/api_alerts.yaml
  Gap: product-API-specific 404 rate alert not yet added
  Gap: cold storage (90d) not yet implemented

8. Files Reviewed

FilePurpose
prometheus_alerts.ymlPrometheus alerting rules
loki_rules/api_alerts.yamlLoki log-based alerting rules
k8s/production/loki-configmap.yamlLoki retention config
k8s/production/loki-alerts-configmap.yamlK8s Loki alerting rules
alertmanager.ymlAlert routing configuration
app/request_logging.pyAPI request logging middleware
app/logging_centralized.pyCentralized structured logger
config/logrotate.confHost-level log rotation
config/logrotate.host.confPM2 log rotation config
alert_routing_plan.mdScraper alert routing plan (unimplemented)
docs/BUY-2880-logging-plan.mdCentralized logging plan (in progress)