← Back to documentation

lazada_sg_fashion

Lazada SG Fashion Scraping Pipeline

Overview

This document outlines the pipeline for scraping fashion products from Lazada Singapore, targeting women's, men's, and kids' fashion categories.

Current Status

🚧 BLOCKED_UPSTREAM - Requires Lazada Open Platform API credentials (see BUY-480)

Target

  • Goal: Scrape 100K+ Lazada SG fashion products
  • Output: data/normalized/lazada_sg_fashion_normalized.ndjson
  • Categories: Women Fashion, Men Fashion, Kids Fashion, Bags & Luggage, Watches & Jewellery

Pipeline Components

1. Scraper

  • Module: scrapers.lazada_sg
  • Configuration:
    • Target: 300,000 products (shared across all categories)
    • Max pages per category: 100
    • Delay between requests: 0.5 seconds
    • Batch size: 200 products

2. Categories Covered

The Lazada SG scraper includes these fashion-related categories:

  • women-fashion: Women Fashion
  • men-fashion: Men Fashion
  • kids-fashion: Kids Fashion
  • bags-luggage: Bags & Luggage
  • watches-jewellery: Watches & Jewellery

3. Data Flow

  1. Scraper extracts raw product data from Lazada SG
  2. Transforms data to match BuyWhere catalog schema
  3. Saves to intermediate JSONL: data/lazada/lazada_sg_fashion_raw.ndjson
  4. Post-scrape pipeline normalizes to: data/normalized/lazada_sg_fashion_normalized.ndjson
  5. Data becomes available via API endpoint: GET /v1/products?source=lazada_sg&category=fashion

4. Schedule

  • Frequency: Every 12 hours (as defined in scraper_scheduler.py)
  • Status: Currently blocked - requires API credentials

Unblocking Requirements

To unblock this pipeline, the following is needed:

  1. Lazada Open Platform API credentials
  2. Update scraper to use official API instead of web scraping
  3. Remove blocked_upstream: true flag from scheduler config

Alternative Approaches (If API Not Available)

  1. Playwright-based scraping: Similar to Lazada SG Playwright variant
  2. Third-party affiliate APIs: If available through Lazada affiliate program
  3. Cached data sources: Periodic dumps from Lazada partners
  4. User-generated content: Leverage product reviews/social signals

Dependencies

  • Working Lazada SG scraper (scrapers/lazada_sg.py)
  • Scraper scheduler (scripts/scraper_scheduler.py)
  • Post-scrape pipeline (scripts/pipeline.py)
  • API key for BuyWhere ingestion endpoint
  • Database connection for tracking ingestion runs

Monitoring

  • Logs: logs/lazada_sg_scraper.log, logs/lazada_sg_continuous.log
  • Metrics: Tracked in scheduler log and database
  • Alerts: Failed scrapes trigger alerts via scraper alerting system

Related Issues

  • BUY-1727: Scrape Lazada SG fashion & apparel — target 100K products
  • BUY-480: Requires Lazada Open Platform API credentials (blocking issue)
  • GOAL-b146bdd7: Index 5,000,000 products across Singapore and Southeast Asia

Next Steps

  1. Check status of BUY-480 for Lazada API credential provisioning
  2. If credentials become available:
    • Update scraper to use official API
    • Test with small batch
    • Run full scrape for fashion categories
    • Verify output normalization
  3. If credentials not available:
    • Investigate alternative scraping methods
    • Document limitations and workarounds
    • Consider temporary data sources

Document last updated: $(date)