BUY-3442-log-retention-alert-baseline

BUY-3442: Log Retention and Alert Baseline for Product API

Status

Baseline defined — implementation partial

1. Scope

Service: Product API (buywhere-api, all /v1/products, /v2/products, /compare endpoints)
Proxy-free: Excludes sources using ScraperAPI or upstream proxies
Launch-relevant: Focused on error classes and thresholds that directly impact launch readiness and user-facing product quality

2. Log Retention Windows

Layer	Current	Baseline	Notes
Hot (Loki ingester)	720h (30d)	720h (30d)	No change; adequate for launch
Warm (S3 boltdb-shipper index)	720h (30d)	30d	Index on S3; same as hot for now
Cold (S3 chunks)	720h (30d)	90d	BUY-2880 plan specified 90d cold; not yet implemented
Local rotating files	7d standard, 10d PM2	7d	Adequate; Loki is source of truth
Analytics JSONL (api_analytics.jsonl)	5 backups × 100MB	30d	Should align with Loki retention

Gap: BUY-2880 Phase 4 specified hot:7d, warm:30d, cold:90d but only 30d is configured in Loki. Cold storage tier (90d) is not yet implemented.

Action needed: Update k8s/production/loki-configmap.yaml table_manager.retention_period to 2160h (90d) for cold/archival tier, or document that S3 lifecycle rules handle the 90d requirement.

3. Must-Watch Error Classes

3.1 HTTP Error Class Taxonomy

Error Class	HTTP Codes	Root Cause	Impact
`5xx_server_error`	500, 502, 503, 504	Backend failure, DB timeout, OOM	User-facing errors, failed reads
`4xx_client_error`	400, 401, 403, 404, 429	Bad request, auth failure, rate limit, not found	Degraded functionality
`product_not_found`	404 on `/products/*`	Missing product, stale ID	Empty compare results
`ingestion_rejected`	422, 429 on write	Schema mismatch, rate limit	Data quality gaps
`timeout_error`	504	Upstream latency, pool exhaustion	Failed user requests

3.2 Scraper-to-API Failure Classes (proxy-free)

Failure Class	Detection	Owner	Launch Relevance
`parse_failure`	`returncode != 0` after retries	Data-eng	Direct product gaps
`data_write_failure`	`rows_failed > 0` with `status=completed`	Backend-eng	Missing products in API
`ingestion_stall`	`hours_since_last_run > interval * 2`	On-call	Stale compare data
`scheduler_stall`	Lock file staleness	Platform-eng	Fleet-wide stall

3.3 Compare Experience Errors

Error Class	Detection	Threshold	Severity
`zero_match_rate`	`% requests with 0 results > 30%`	30% over 30min	Warning
`compare_cache_hit_rate`	`cache_hits / cache_requests < 50%`	50% over 15min	Warning
`compare_ingestion_stale`	`last_ingestion_timestamp > 2h ago`	2h no ingestion	Critical

4. Alert Thresholds — Product API

4.1 Existing PromQL Alerts (Validated)

Alert	Expr	For	Severity	Threshold	Status
`HighErrorRate`	`rate(http_errors_total[5m]) / rate(http_requests_total[5m])`	5m	critical	>1%	✅ Implemented
`HighLatencyP95`	`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`	5m	warning	>1s	✅ Implemented
`HighLatencyP99`	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	5m	critical	>2.5s	✅ Implemented
`APIDown`	`up{job="buywhere-api"} == 0`	2m	critical	0 = down	✅ Implemented
`DatabasePoolExhausted`	`db_connection_pool_checked_out / db_connection_pool_size >= 0.9`	5m	warning	≥90%	✅ Implemented
`ComparePageIngestionStale`	`(time() - avg(last_over_time(buywhere_last_ingestion_timestamp[1h]))) > 7200`	10m	critical	>2h stale	✅ Implemented
`CompareMatchQualityDrop`	`rate(buywhere_compare_zero_matches_total[30m]) / rate(buywhere_compare_requests_total[30m]) > 0.3`	30m	warning	>30% zero-match	✅ Implemented
`CompareCacheHitRateLow`	`rate(buywhere_cache_hits_total{endpoint="/compare"}[15m]) / rate(buywhere_cache_requests_total{endpoint="/compare"}[15m]) < 0.5`	15m	warning	<50%	✅ Implemented

4.2 Loki Log-Based Alerts (Validated)

Alert	Expr	For	Severity	Threshold	Status
`HighErrorRate` (Loki)	`sum(rate({job=~"buywhere-api	all-services"} \|= "error" [5m])) by (service) / sum(rate({job=~"buywhere-api	all-services"}[5m])) by (service) > 0.01`	5m	critical
`APIServiceDown`	`sum(rate({job=~"buywhere-api	all-services"}[5m])) by (service) == 0`	2m	critical	0 logs
`NoLogsReceived`	`sum(rate({job=~"buywhere-api	all-services"}[5m])) == 0`	5m	warning	0 logs

4.3 Gap: Missing Product-API-Specific Alerts

Gap	Severity	Recommendation
No alert for `product_not_found` rate	warning	Add PromQL: `rate(http_requests_total{status="404", path=~"/v[12]/products."}[5m]) / rate(http_requests_total{path=~"/v[12]/products."}[5m]) > 0.1`
No alert for compare zero-match critical threshold	critical	Raise to P1: `> 0.5` (50%) for 5min
No alert for ingestion write failure rate	warning	Add: `scraper_rows_failed_total / scraper_rows_total > 0.05`
No structured error budget	—	Define error budget policy (see §6)

4.4 Alert Routing

Current: Alertmanager routes by severity label only:

critical → PagerDuty + webhook
warning → webhook (Slack #alerts)

Gap: No routing by failure_class or endpoint dimension. alert_routing_plan.md outlines failure-class routing but is unimplemented.

5. Validation Notes

Check	Result	Notes
Prometheus alerts fire on 5xx	✅	`HighErrorRate` at 1% threshold confirmed
Loki log alerts fire on errors	✅	`HighErrorRate` Loki rule at 1% threshold confirmed
Log retention at 30d in Loki	✅	`retention_period: 720h` confirmed in `loki-configmap.yaml`
PM2 log rotation at 100MB / 5 backups	✅	Confirmed in `request_logging.py` (`MAX_BYTES = 100 * 1024 * 1024`, `BACKUP_COUNT = 5`)
Log files compressed after rotation	✅	`compress`, `delaycompress` in logrotate configs
S3 bucket for Loki: `buywhere-loki-centralized`	✅	Confirmed in loki-configmap.yaml
Alertmanager PagerDuty routing	✅	`critical-alerts-pagerduty` receiver configured
Blackbox monitoring for API / MCP / Website	✅	Multiple `up{job="blackbox-..."}` alerts confirmed

6. Remaining Risks

Risk	Likelihood	Impact	Mitigation
Cold storage tier (90d) not implemented	High	Compliance gap if logs needed beyond 30d	File issue to extend Loki retention to 2160h; document S3 lifecycle rule as alternative
Error budget policy not defined	Medium	No SLO burn rate alerting	Create SLO document; add `ErrorBudgetBurning` alert at 0.5% (half of 1% budget)
`failure_class` routing not implemented	Medium	All scraper alerts route to same on-call	Implement BUY-2880 follow-up: add `failure_class` label to Loki/Prometheus rules
No structured `product_not_found` alerting	Low	404s on products not distinguishable from other 404s	Add path-specific 404 alert (see §4.3)
Analytics JSONL retention not enforced	Low	api_analytics.jsonl rotates but not centrally retained	Confirm analytics logs flow to Loki or extend logrotate for 30d
Log gap between PM2 rotation and Loki pickup	Low	Brief window where logs exist only on disk	Fluent Bit reads from container logs; minor window acceptable for launch

7. Summary: Baseline Configuration

RETENTION
  Loki (hot/warm):     30 days  [adequate for launch]
  Loki (cold/archival): 90 days [NOT YET IMPLEMENTED — action needed]
  Local files:          7 days  [adequate; Loki is source of truth]

ERROR CLASSES (must-watch)
  5xx server errors        → critical at >1% over 5min
  P95 latency             → warning at >1s over 5min
  P99 latency             → critical at >2.5s over 5min
  API down                → critical at 0 for >2min
  Compare zero-match      → warning at >30% over 30min; critical at >50% over 5min
  Compare cache hit rate   → warning at <50% over 15min
  Ingestion stale         → critical at >2h since last run
  DB pool exhaustion      → warning at ≥90% over 5min

THRESHOLDS (actionable)
  All above are configured in prometheus_alerts.yml and loki_rules/api_alerts.yaml
  Gap: product-API-specific 404 rate alert not yet added
  Gap: cold storage (90d) not yet implemented

8. Files Reviewed

File	Purpose
`prometheus_alerts.yml`	Prometheus alerting rules
`loki_rules/api_alerts.yaml`	Loki log-based alerting rules
`k8s/production/loki-configmap.yaml`	Loki retention config
`k8s/production/loki-alerts-configmap.yaml`	K8s Loki alerting rules
`alertmanager.yml`	Alert routing configuration
`app/request_logging.py`	API request logging middleware
`app/logging_centralized.py`	Centralized structured logger
`config/logrotate.conf`	Host-level log rotation
`config/logrotate.host.conf`	PM2 log rotation config
`alert_routing_plan.md`	Scraper alert routing plan (unimplemented)
`docs/BUY-2880-logging-plan.md`	Centralized logging plan (in progress)