Watsons SG Category Extraction & Normalization Pass
Issue: BUY-3133
Date: 2026-04-18
Status: Blocked - Cloudflare protection
Executive Summary
Watsons Singapore (watsons.com.sg) is protected by Cloudflare/Akamai anti-bot measures that block all automated access from our infrastructure. All scraping approaches tested — direct HTTP, Playwright, undetected-chromedriver — result in HTTP 403 "Access Denied".
Primary Blocker: Cloudflare requires residential proxy or specialized bypass infrastructure not currently available.
Category Entry Points (Known URLs)
| Category | ID | URL Path | Notes |
|---|---|---|---|
| Skincare | skincare | /skincare | High-intent, high volume |
| Hair Care | hair-care | /hair-care | |
| Personal Care | personal-care | /personal-care | |
| Vitamins & Supplements | vitamins-supplements | /vitamins-supplements | High-intent per issue |
| Baby Care | baby-care | /baby-care | High-intent per issue |
| Makeup | makeup | /makeup | |
| Fragrances | fragrances | /fragrances | |
| Bath & Body | bath-body | /bath-body | |
| Men's Grooming | mens-grooming | /mens-grooming | |
| Health Care | health-care | /health-care |
Focus categories per BUY-3133: Skincare, Supplements (Vitamins & Supplements), Oral Care, Baby Care
Note: Oral Care is not a separate top-level category - it may be nested under Personal Care or Health Care.
Target API Structure
Based on code analysis, Watsons SG uses a PLP (Product Listing Page) API:
GET /api/catalog/v1/plp/products?path={category_path}&page={page}&pageSize={pageSize}&sort=relevance
Observed API Response Schema
{
"products": [...],
"totalProducts": 0,
"totalPages": 0
}
Product Fields (from code analysis)
| Field | Source Key | Notes |
|---|---|---|
| sku | sku or id | Required |
| merchant_id | constant | "watsons_sg" |
| title | name, title, displayName | Required |
| description | description, shortDescription | |
| price | price.value or price.amount | Float |
| currency | constant | "SGD" |
| url | url, productUrl | Full URL |
| image_url | images[0].url | First image |
| category | category_id | Top-level category |
| category_path | [category_name] | Array |
| brand | brand, brandName | |
| is_active | derived | True unless out of stock |
| metadata | various | rating, discount, etc. |
Schema Gaps Identified
- No subcategory field captured - category_path is flat
[category_name], no nested subcategories - Stock status unclear - API may not return stock, only availability text
- No variant data - If products have variants (size, color), not captured
- Rating data inconsistent - Rating extraction varies across scraper versions
Sample Normalized Output (Design Target)
Based on the transform_product() in scrapers/watsons_sg.py:
{
"sku": "12345678",
"merchant_id": "watsons_sg",
"title": "Laneige Water Sleeping Mask 100ml",
"description": "Hydrating overnight mask",
"price": 45.90,
"currency": "SGD",
"url": "https://www.watsons.com.sg/skincare/p/12345678",
"image_url": "https://images.watsons.com/...",
"category": "skincare",
"category_path": ["Skincare"],
"brand": "Laneige",
"is_active": true,
"metadata": {
"rating_score": 4.5,
"rating_count": 1234,
"subcategory": "",
"discount_pct": 20,
"original_price": 57.00
}
}
Blocker Summary
| Item | Detail |
|---|---|
| Blocker | Cloudflare/Akamai protection on watsons.com.sg |
| Error | HTTP 403 "Access Denied" |
| All Approaches Failed | httpx, Playwright, undetected-chromedriver |
| Root Cause | Cloudflare fingerprinting detects automated requests |
| Solution Required | Residential proxy (e.g., ScraperAPI residential, Bright Data) OR specialized bypass |
Options to Unblock
- ScraperAPI Residential Proxy -
SCRAPERAPI_KEYalready configured inscrapers/watsons_sg_ready.pybut not yet tested successfully - Bright Data Residential Proxies - Alternative residential proxy provider
- Manual Data Partnership - Direct data feed arrangement with Watsons
Next Steps
- Test ScraperAPI residential proxy - The
watsons_sg_scraperapi.pyis ready but needs key validation - Evaluate Bright Data as alternative proxy provider
- Consider data partnership outreach to Watsons Singapore
Files Analyzed
scrapers/watsons_sg.py- Main API-based scraperscrapers/watsons_sg_hybrid.py- Playwright + API hybridscrapers/watsons_sg_undetected.py- undetected-chromedriver approachscrapers/watsons_sg_undetected_v2.py- Enhanced stealthscrapers/watsons_sg_ready.py- ScraperAPI wrapper (ready but untested)src/buywhere/adapters/scraper_watsons_sg.py- Adapter classlogs/watsons_sg_scrape_*.log- Previous failed attempts