Plan: Centralized Logging Aggregation for All Microservices
Issue
BUY-2880 - Implement centralized logging aggregation for all microservices
Status
In Progress
Problem Statement
BuyWhere has multiple microservices (API, scrapers, MCP, background jobs) that currently log in different formats. This makes it difficult to:
- Aggregate and search logs across services
- Correlate errors across service boundaries
- Set up unified alerting
- Build service-specific dashboards
Current State
What Exists
docs/logging-schema.md- Standardized logging schema specificationapp/logging_centralized.py- Centralized logger for API serviceapp/request_logging.py- Request logging middleware with structured outputscrapers/scraper_logging.py&base_scraper.py- Scraper-specific structured loggingdocker-compose.prod.yml- Loki + Fluent Bit + Grafana stackk8s/production/&k8s/staging/- Fluent Bit and Loki Kubernetes configurationsgrafana/provisioning/dashboards/loki-logs.json- Loki logs dashboardk8s/production/loki-alerts-configmap.yaml- Loki alerting rules
Gaps Identified
- Inconsistent log formats: Scrapers use
platformfield; API usesservicefield - Missing service labels: Docker Compose scraper services lack labels for Fluent Bit filtering
- No unified job label: Loki queries reference
job="buywhere-api"but scrapers usejob="scraper-fleet" - Missing log level standardization: Log levels not consistently applied
- Grafana dashboard limited: Current dashboard only shows API logs, not scraper fleet logs
Implementation Plan
Phase 1: Unify Log Format Across All Services
1.1 Update scrapers/base_scraper.py to use centralized logging
- Import
get_loggerfromapp.logging_centralized - Replace
StructuredLoggerwith centralized logger - Map scraper-specific fields to schema fields:
platform→serviceerror_type→ include inmetadata
- Add
scraper_nameas service identifier
1.2 Update app/logging_centralized.py
- Add
log_scraper_progressfunction already exists but ensure it's properly integrated
Phase 2: Docker Compose Logging Labels
2.1 Update docker-compose.yml scraper services
Add logging labels to all scraper services:
logging:
driver: json-file
options:
max-size: "50m"
max-file: "5"
labels: "service,environment"
2.2 Add labels to all service definitions
- api, redis, db, migrate services need
servicelabel
Phase 3: Fluent Bit Configuration Updates
3.1 Update parsers.conf for unified parsing
- Add parser for scraper log format
- Ensure
timestamp,level,service,messagefields are extracted
3.2 Add service label extraction
- Use Kubernetes labels and Docker container labels
- Extract
servicelabel for Loki label mapping
Phase 4: Loki Configuration
4.1 Update Loki schema
- Ensure index naming follows
logs-{service}-YYYY.MM.DDconvention
4.2 Verify retention policies
- Hot: 7 days
- Warm: 30 days
- Cold: 90 days
Phase 5: Grafana Dashboard Updates
5.1 Update loki-logs.json dashboard
- Add service selector dropdown
- Add log level filter
- Add search functionality
- Include scraper-fleet queries alongside API queries
Phase 6: Verification
6.1 Test log flow
- Generate test logs from each service
- Verify logs appear in Loki
- Verify Grafana can query all services
6.2 Verify alerting
- Test Loki alert rules fire correctly
Files to Modify
scrapers/base_scraper.py- Use centralized loggingscrapers/scraper_logging.py- Keep for backwards compatibility, use centralized loggerdocker-compose.yml- Add logging labels to all servicesdocker-compose.prod.yml- Ensure all services have proper labelsk8s/production/fluent-bit-configmap.yaml- Update parsers for unified formatk8s/staging/fluent-bit-configmap.yaml- Same updatesgrafana/provisioning/dashboards/loki-logs.json- Add multi-service support
Success Criteria
- All services output JSON logs to stdout with consistent schema
- Fluent Bit collects logs from all containers
- Loki stores logs with proper labels for filtering
- Grafana dashboard shows logs from all services
- Loki alerts fire for errors in any service
- Logs can be correlated by
trace_idacross service boundaries