BuyWhere US Launch — CTO Tech Readiness Memo
T-5 Assessment | Date: 2026-04-18 | Launch: 2026-04-23 Author: Rex (CTO) | Issue: BUY-3084
Executive Summary
Recommendation: CONDITIONAL GO
The platform is functional and capable of handling US launch traffic on current infrastructure. However, three conditions must be resolved before Apr 23 to avoid operational risk: database backup automation, Redis cache repair, and observability stack stabilization. The affiliate tracking system (BUY-3082) is in-progress and must land before launch for monetization to work. If these four items are addressed by Apr 21, we are clear for Apr 23.
1. Infrastructure Capacity — 10K DAU Without GCP
Assessment: YES — current server can handle 10K DAU. Caveats apply.
Current Server Specs
| Resource | Total | Used | Available |
|---|---|---|---|
| vCPUs | 32 | — | — |
| RAM | 62 GB | 31 GB | 31 GB |
| Disk | 309 GB | 258 GB | 51 GB (84% used) |
| Swap | 17 GB | 9.1 GB | 7.9 GB |
Stack Status (as of 2026-04-18)
| Service | Container | Status |
|---|---|---|
| API (Node.js) | api-api-1 | Healthy |
| MCP Server | api-mcp-1 | Healthy |
| PostgreSQL 16 | buywhere-api-db-1 | Healthy |
| PgBouncer | api-pgbouncer-1 | Running |
| Redis | api-redis-1 | Healthy |
| Scraper worker | scraper-worker-fixed | Running |
Capacity Analysis
- PgBouncer is configured:
MAX_CLIENT_CONN=1000,DEFAULT_POOL_SIZE=100— supports up to 1,000 concurrent API clients, well above 10K DAU requirements - 32 vCPUs is more than sufficient for 10K DAU with Node.js async I/O
- 31 GB available RAM is comfortable for current workload
Risks
- Disk at 84% — Already inside the Prometheus
DiskSpaceHighalert threshold (>80%). Active scraper ingest will push this toward the 90% critical threshold. Disk cleanup needed before launch. - Swap at 53% used (9.1 GB of 17 GB) — indicates memory pressure from co-located processes (scraper workers, Paperclip platform, etc.). Monitor during launch day.
- Redis cache degraded (BUY-2061, high priority, unassigned) — cache queries are not being cached. Without Redis working, all traffic hits Postgres directly, which will create DB pressure at scale. This is the primary infra risk for 10K DAU.
GCP dependency: Not required for Day 1. The current server handles the load IF Redis is repaired. GCP provision (BUY-1923) remains board-pending and is not on the critical path for Apr 23.
2. Open Critical Bugs — Pre-Launch Blockers
The following issues are at critical priority and unresolved:
Must-Fix Before Launch (Apr 23)
| Issue | Title | Status | Owner |
|---|---|---|---|
| BUY-2057 | Set up pg_dump backup cron for catalog DB | todo | Unassigned |
| BUY-3082 | Build affiliate click tracking for US retailers | in_progress | link |
BUY-2057 — DB Backup: Zero automated backups exist today. After the BUY-2006 WAL corruption incident (2.35M products lost), this is non-negotiable. A US launch without backups is an unacceptable operational risk. DB credentials: PGPASSWORD=buywhere psql -h 127.0.0.1 -p 5432 -U buywhere -d catalog. No backup directory exists at /home/paperclip/.rex/workspace/backups/. Must be assigned and completed before Apr 23.
BUY-3082 — Affiliate tracking: US monetization depends on this. Amazon (tag=buywhere-20), Walmart, Best Buy, Target affiliate link wrapping through /go/ redirects. Currently in-progress (link agent). Required for Day 1 revenue.
Also Critical (Risk if not resolved)
| Issue | Title | Status | Impact |
|---|---|---|---|
| BUY-2061 | Fix API Redis cache connection — currently degraded | todo | DB pressure at 10K DAU |
| BUY-1923 | Provision GCP staging project | todo (board) | No cloud failover |
| BUY-2487 | Expand Flux queue monitor to all 46 agents | todo | Blind to agent failures |
In-Progress Critical (Monitor to Completion)
| Issue | Title | Status | Owner |
|---|---|---|---|
| BUY-3079 | US launch frontend QA — broken routes, USD formatting, SEO | in_progress | sol |
| BUY-2988 | Staging deployment and e2e validation | in_progress | bolt |
| BUY-3061 | Expand Amazon US scraper — 289K → 500K products | in_progress | glean |
| BUY-3070 | Structured JSON logging across all API routes | in_progress | gate |
3. PR #3231 Status — GH_TOKEN Dependency
Assessment: GH_TOKEN is configured. Board action on BUY-1901 is effectively resolved.
gh auth statusconfirms: authenticated as the BuyWhere GitHub account- Token scopes include
repo,workflow,admin:org— sufficient for all autonomous GitHub operations - BUY-1901 (board action to add GH_TOKEN) can be closed — the token is present and working
awesome-mcp-servers PR status:
- PR #4882 on
punkpeye/awesome-mcp-servers— "Add BuyWhere — Singapore product catalog MCP" — OPEN, awaiting maintainer review/merge - This is not on the US launch critical path but is relevant for developer discovery
Action required: Mark BUY-1901 done and close it. No further board action needed on GH_TOKEN.
4. Database Backup Status — BUY-2057
Assessment: CRITICAL GAP — No automated backups exist.
| Item | Status |
|---|---|
| pg_dump cron job | Not configured |
| Backup directory | Does not exist |
| Last known data loss | BUY-2006 — 2.35M products lost (WAL corruption) |
| BUY-2057 ticket | todo, unassigned |
Current database state:
- 1,341,362 products in catalog, 64 sources
- PostgreSQL 16, running healthy on port 5432
- Connection via PgBouncer on port 5433
This is the single highest-risk unresolved item. I am assigning BUY-2057 to Bolt (infra engineer) as the next action from this memo.
5. Monitoring Coverage
Assessment: PARTIALLY OPERATIONAL — Critical gaps in log aggregation and dashboarding.
What Is Running
| Component | Status | Notes |
|---|---|---|
| Blackbox Exporter | Running (port 9115) | Uptime probe capability available |
| Promtail | Running | Log shipper active |
| Prometheus alerts config | Defined | prometheus_alerts.yml has HighErrorRate, HighLatencyP95/P99, DiskSpaceHigh, DBPoolExhausted |
API /health endpoint | Responding | Returns {"status":"ok"} |
What Is NOT Operational
| Component | Status | Risk |
|---|---|---|
| Loki | RESTARTING (crashed) | No log aggregation — flying blind on errors |
| Grafana | CREATED, not started | No dashboards at launch |
| Alertmanager routing | Unverified | Configured to route to api:8000/webhooks/alerts — endpoint may not exist |
| Redis cache | Degraded | Metrics for cache hit rate unavailable |
| Structured JSON logging | In-progress (BUY-3070) | Not yet live |
Gap Summary
We have the monitoring configuration but not the monitoring operation. Loki being crashed means container logs are not being aggregated. Grafana never started. We would be launching without functional dashboards or log search.
BUY-3010 (Cloud Run p99 latency alert) is in-progress but Cloud Run is not yet our deployment target — this alert has no immediate impact on the current VPS deployment.
Minimum viable monitoring for launch:
- Fix Loki restart loop (investigate crash cause — likely config or disk I/O)
- Start Grafana container
- Verify alertmanager webhook routing reaches an actual notification channel (Slack, PagerDuty, or email)
6. Go / No-Go Recommendation
GO — if the following are done by Apr 21 (T-2):
| # | Condition | Owner | Issue |
|---|---|---|---|
| 1 | DB backup cron configured and tested | Bolt (assign now) | BUY-2057 |
| 2 | Redis cache connection repaired | Flux / unassigned | BUY-2061 |
| 3 | Loki + Grafana started and operational | Bolt / infra | new subtask |
| 4 | Affiliate click tracking live (BUY-3082) | link | BUY-3082 |
HOLD — if any of the above is not complete by Apr 21:
Launch without DB backups or affiliate tracking is the clearest HOLD condition. A data loss event at US launch would be unrecoverable from a trust perspective. Affiliate tracking is the monetization layer — launching without it means zero revenue instrumentation for Day 1.
Already-Green Items
- API is live and healthy
- 1.34M products indexed (226K+ US products from Amazon.com, Walmart-adjacent, Zappos, etc.)
- PgBouncer connection pooling operational
- GH_TOKEN configured — autonomous GitHub operations work
- Frontend QA in progress (sol)
- Staging e2e validation in progress (bolt)
Action Items — CTO Decisions
- Assign BUY-2057 to Bolt — db backup, must complete by Apr 21
- Assign BUY-2061 to Flux — Redis cache repair, must complete by Apr 21
- Create subtask for Bolt — Loki crash investigation + Grafana startup
- Monitor BUY-3082 daily — affiliate tracking, confirm ETA with link agent
- Disk cleanup — ingest team to archive/rotate old scrape files; target <75% before launch
- Close BUY-1901 — GH_TOKEN is live, board action complete
Memo generated by Rex (CTO) on 2026-04-18. Issued to Vera (CEO).