← Back to documentation

category_extraction_20260418

Watsons SG Category Extraction & Normalization Pass

Issue: BUY-3133
Date: 2026-04-18
Status: Blocked - Cloudflare protection


Executive Summary

Watsons Singapore (watsons.com.sg) is protected by Cloudflare/Akamai anti-bot measures that block all automated access from our infrastructure. All scraping approaches tested — direct HTTP, Playwright, undetected-chromedriver — result in HTTP 403 "Access Denied".

Primary Blocker: Cloudflare requires residential proxy or specialized bypass infrastructure not currently available.


Category Entry Points (Known URLs)

CategoryIDURL PathNotes
Skincareskincare/skincareHigh-intent, high volume
Hair Carehair-care/hair-care
Personal Carepersonal-care/personal-care
Vitamins & Supplementsvitamins-supplements/vitamins-supplementsHigh-intent per issue
Baby Carebaby-care/baby-careHigh-intent per issue
Makeupmakeup/makeup
Fragrancesfragrances/fragrances
Bath & Bodybath-body/bath-body
Men's Groomingmens-grooming/mens-grooming
Health Carehealth-care/health-care

Focus categories per BUY-3133: Skincare, Supplements (Vitamins & Supplements), Oral Care, Baby Care

Note: Oral Care is not a separate top-level category - it may be nested under Personal Care or Health Care.


Target API Structure

Based on code analysis, Watsons SG uses a PLP (Product Listing Page) API:

GET /api/catalog/v1/plp/products?path={category_path}&page={page}&pageSize={pageSize}&sort=relevance

Observed API Response Schema

{
  "products": [...],
  "totalProducts": 0,
  "totalPages": 0
}

Product Fields (from code analysis)

FieldSource KeyNotes
skusku or idRequired
merchant_idconstant"watsons_sg"
titlename, title, displayNameRequired
descriptiondescription, shortDescription
priceprice.value or price.amountFloat
currencyconstant"SGD"
urlurl, productUrlFull URL
image_urlimages[0].urlFirst image
categorycategory_idTop-level category
category_path[category_name]Array
brandbrand, brandName
is_activederivedTrue unless out of stock
metadatavariousrating, discount, etc.

Schema Gaps Identified

  1. No subcategory field captured - category_path is flat [category_name], no nested subcategories
  2. Stock status unclear - API may not return stock, only availability text
  3. No variant data - If products have variants (size, color), not captured
  4. Rating data inconsistent - Rating extraction varies across scraper versions

Sample Normalized Output (Design Target)

Based on the transform_product() in scrapers/watsons_sg.py:

{
  "sku": "12345678",
  "merchant_id": "watsons_sg",
  "title": "Laneige Water Sleeping Mask 100ml",
  "description": "Hydrating overnight mask",
  "price": 45.90,
  "currency": "SGD",
  "url": "https://www.watsons.com.sg/skincare/p/12345678",
  "image_url": "https://images.watsons.com/...",
  "category": "skincare",
  "category_path": ["Skincare"],
  "brand": "Laneige",
  "is_active": true,
  "metadata": {
    "rating_score": 4.5,
    "rating_count": 1234,
    "subcategory": "",
    "discount_pct": 20,
    "original_price": 57.00
  }
}

Blocker Summary

ItemDetail
BlockerCloudflare/Akamai protection on watsons.com.sg
ErrorHTTP 403 "Access Denied"
All Approaches Failedhttpx, Playwright, undetected-chromedriver
Root CauseCloudflare fingerprinting detects automated requests
Solution RequiredResidential proxy (e.g., ScraperAPI residential, Bright Data) OR specialized bypass

Options to Unblock

  1. ScraperAPI Residential Proxy - SCRAPERAPI_KEY already configured in scrapers/watsons_sg_ready.py but not yet tested successfully
  2. Bright Data Residential Proxies - Alternative residential proxy provider
  3. Manual Data Partnership - Direct data feed arrangement with Watsons

Next Steps

  1. Test ScraperAPI residential proxy - The watsons_sg_scraperapi.py is ready but needs key validation
  2. Evaluate Bright Data as alternative proxy provider
  3. Consider data partnership outreach to Watsons Singapore

Files Analyzed

  • scrapers/watsons_sg.py - Main API-based scraper
  • scrapers/watsons_sg_hybrid.py - Playwright + API hybrid
  • scrapers/watsons_sg_undetected.py - undetected-chromedriver approach
  • scrapers/watsons_sg_undetected_v2.py - Enhanced stealth
  • scrapers/watsons_sg_ready.py - ScraperAPI wrapper (ready but untested)
  • src/buywhere/adapters/scraper_watsons_sg.py - Adapter class
  • logs/watsons_sg_scrape_*.log - Previous failed attempts