lazada_sg_fashion

Lazada SG Fashion Scraping Pipeline

This document outlines the pipeline for scraping fashion products from Lazada Singapore, targeting women's, men's, and kids' fashion categories.

🚧 BLOCKED_UPSTREAM - Requires Lazada Open Platform API credentials (see BUY-480)

Goal: Scrape 100K+ Lazada SG fashion products
Output: data/normalized/lazada_sg_fashion_normalized.ndjson
Categories: Women Fashion, Men Fashion, Kids Fashion, Bags & Luggage, Watches & Jewellery

Module: scrapers.lazada_sg
Configuration:
- Target: 300,000 products (shared across all categories)
- Max pages per category: 100
- Delay between requests: 0.5 seconds
- Batch size: 200 products

The Lazada SG scraper includes these fashion-related categories:

Scraper extracts raw product data from Lazada SG
Transforms data to match BuyWhere catalog schema
Saves to intermediate JSONL: data/lazada/lazada_sg_fashion_raw.ndjson
Post-scrape pipeline normalizes to: data/normalized/lazada_sg_fashion_normalized.ndjson
Data becomes available via API endpoint: GET /v1/products?source=lazada_sg&category=fashion

To unblock this pipeline, the following is needed:

Check status of BUY-480 for Lazada API credential provisioning
If credentials become available:
- Update scraper to use official API
- Test with small batch
- Run full scrape for fashion categories
- Verify output normalization
If credentials not available:
- Investigate alternative scraping methods
- Document limitations and workarounds
- Consider temporary data sources

Document last updated: $(date)