SSL Certificate Renewal Runbook
Overview
This runbook covers the automated SSL certificate renewal system for BuyWhere production infrastructure.
Domain Inventory
| Domain | Type | Certificate | Renewal Mechanism | Monitored |
|---|
api.buywhere.ai | Production API | Let's Encrypt | GitHub Actions (certbot) | Yes |
docs.buywhere.ai | Documentation | Same cert as api.buywhere.ai | Same renewal | Yes |
api-staging.buywhere.io | Staging API | Let's Encrypt | cert-manager (Kubernetes) | Yes |
buywhere.ai | Marketing site | External | N/A | Yes (blackbox only) |
Architecture
Production (EC2)
- Certificate Authority: Let's Encrypt
- Client: certbot (webroot plugin)
- Automation: GitHub Actions (
.github/workflows/ssl-renewal.yml)
- Schedule: Every 12 hours (
0 */12 * * *)
- Webroot:
/var/www/html
- Certificate path:
/etc/letsencrypt/live/api.buywhere.ai/
Staging (Kubernetes)
- Certificate Authority: Let's Encrypt
- Client: cert-manager
- Issuer: ClusterIssuer (
letsencrypt-prod)
- Certificate duration: 90 days
- Renew before: 30 days
Automated Renewal Process
Production EC2
- GitHub Actions workflow triggers every 12 hours
- Workflow SSH's into production EC2 instance
- Runs
ssl_renewal.sh which:
- Checks certificate expiry date
- If < 30 days, runs
certbot renew
- Reloads nginx on success
- Prometheus blackbox exporter monitors certificate expiry
- Alerts fire if certificate expires within 30 days
Staging Kubernetes
- cert-manager automatically checks certificate expiry
- Renews 30 days before expiration
- No manual intervention required
Monitoring & Alerting
Metrics
ssl_cert_expiry_seconds{instance="api.buywhere.ai"} - Prometheus blackbox
ssl_cert_expiry_seconds{instance="api.buywhere.ai"} - node_exporter textfile
Alerts
| Alert | Expression | Severity |
|---|
| SSLCertExpiringSoon | ssl_cert_expiry_seconds{job="blackbox"} < 2592000 | warning |
| SSLCertExpired | ssl_cert_expiry_seconds{job="blackbox"} < 0 | critical |
Manual Renewal Procedure
Production EC2
If automated renewal fails:
# SSH to production instance
ssh ubuntu@<production-ip>
# Check certificate status
sudo /home/ubuntu/scripts/ssl_renewal.sh check
# Force renewal if needed
sudo /home/ubuntu/scripts/ssl_renewal.sh renew
# Verify
sudo /home/ubuntu/scripts/ssl_renewal.sh check
Emergency Manual Renewal (certbot)
# SSH to production instance
ssh ubuntu@<production-ip>
# Ensure webroot exists
sudo mkdir -p /var/www/html
# Run certbot directly
sudo certbot renew --force-renewal --webroot -w /var/www/html
# Reload nginx
sudo systemctl reload nginx
acme.sh vs certbot Comparison
| Feature | certbot | acme.sh |
|---|
| Installation | apt/pip | curl/bash |
| Root required | Yes (usually) | No (can run as user) |
| Renewal method | webroot, dns, nginx | all certbot methods + DNS API |
| Size | Larger (full Python stack) | Minimal (~5KB shell script) |
| Maintenance | EFF maintained | Active community |
| Current setup | Yes (production) | No |
Recommendation: Keep certbot for now. certbot is well-maintained, has extensive documentation, and integrates well with our existing GitHub Actions workflow. acme.sh migration would provide minimal benefits for our use case.
Troubleshooting
Certificate not found
# Check certificate directory
ls -la /etc/letsencrypt/live/
# Verify certificate contents
sudo openssl x509 -in /etc/letsencrypt/live/api.buywhere.ai/fullchain.pem -noout -dates
certbot not installed
# Install certbot
sudo apt update
sudo apt install certbot python3-certbot-nginx
# Or use snap
sudo snap install certbot
Webroot challenge failing
# Verify webroot is accessible
ls -la /var/www/html/
# Test nginx configuration
sudo nginx -t
# Check nginx is running
sudo systemctl status nginx
GitHub Actions renewal failing
- Check workflow logs at:
https://github.com/<org>/buywhere-api/actions/workflows/ssl-renewal.yml
- Verify AWS credentials are valid
- Verify SSH key for EC2 access is not expired
- Check production instance is reachable
Emergency Contacts
| Role | Contact | Escalation |
|---|
| Primary Ops | Paperclip/Ops Agent | PagerDuty |
| Secondary Ops | Bolt (Engineering Lead) | Slack #incidents |
| AWS Infrastructure | AWS Support | AWS Console |
Related Documentation