Lazada SG Fashion Scraping Pipeline
Overview
This document outlines the pipeline for scraping fashion products from Lazada Singapore, targeting women's, men's, and kids' fashion categories.
Current Status
🚧 BLOCKED_UPSTREAM - Requires Lazada Open Platform API credentials (see BUY-480)
Target
- Goal: Scrape 100K+ Lazada SG fashion products
- Output:
data/normalized/lazada_sg_fashion_normalized.ndjson - Categories: Women Fashion, Men Fashion, Kids Fashion, Bags & Luggage, Watches & Jewellery
Pipeline Components
1. Scraper
- Module:
scrapers.lazada_sg - Configuration:
- Target: 300,000 products (shared across all categories)
- Max pages per category: 100
- Delay between requests: 0.5 seconds
- Batch size: 200 products
2. Categories Covered
The Lazada SG scraper includes these fashion-related categories:
women-fashion: Women Fashionmen-fashion: Men Fashionkids-fashion: Kids Fashionbags-luggage: Bags & Luggagewatches-jewellery: Watches & Jewellery
3. Data Flow
- Scraper extracts raw product data from Lazada SG
- Transforms data to match BuyWhere catalog schema
- Saves to intermediate JSONL:
data/lazada/lazada_sg_fashion_raw.ndjson - Post-scrape pipeline normalizes to:
data/normalized/lazada_sg_fashion_normalized.ndjson - Data becomes available via API endpoint:
GET /v1/products?source=lazada_sg&category=fashion
4. Schedule
- Frequency: Every 12 hours (as defined in scraper_scheduler.py)
- Status: Currently blocked - requires API credentials
Unblocking Requirements
To unblock this pipeline, the following is needed:
- Lazada Open Platform API credentials
- Update scraper to use official API instead of web scraping
- Remove
blocked_upstream: trueflag from scheduler config
Alternative Approaches (If API Not Available)
- Playwright-based scraping: Similar to Lazada SG Playwright variant
- Third-party affiliate APIs: If available through Lazada affiliate program
- Cached data sources: Periodic dumps from Lazada partners
- User-generated content: Leverage product reviews/social signals
Dependencies
- Working Lazada SG scraper (
scrapers/lazada_sg.py) - Scraper scheduler (
scripts/scraper_scheduler.py) - Post-scrape pipeline (
scripts/pipeline.py) - API key for BuyWhere ingestion endpoint
- Database connection for tracking ingestion runs
Monitoring
- Logs:
logs/lazada_sg_scraper.log,logs/lazada_sg_continuous.log - Metrics: Tracked in scheduler log and database
- Alerts: Failed scrapes trigger alerts via scraper alerting system
Related Issues
- BUY-1727: Scrape Lazada SG fashion & apparel — target 100K products
- BUY-480: Requires Lazada Open Platform API credentials (blocking issue)
- GOAL-b146bdd7: Index 5,000,000 products across Singapore and Southeast Asia
Next Steps
- Check status of BUY-480 for Lazada API credential provisioning
- If credentials become available:
- Update scraper to use official API
- Test with small batch
- Run full scrape for fashion categories
- Verify output normalization
- If credentials not available:
- Investigate alternative scraping methods
- Document limitations and workarounds
- Consider temporary data sources
Document last updated: $(date)