Last month, a 47-minute outage cost one of our clients $180,000 in lost revenue. That’s $3,830 per minute.
For a SaaS company doing $25M ARR, the math is straightforward:
- Annual revenue: $25,000,000
- Revenue per minute: ~$47.50
- But during peak hours? Multiply by 3-5x
The Hidden Costs
Direct revenue loss is just the beginning. Downtime also costs you:
Customer trust: It takes months to earn, minutes to lose. One bad outage can trigger a wave of churn.
Engineering time: Your team drops everything to firefight. That feature they were shipping? Delayed.
Reputation: Every outage is a story customers tell. “Their platform went down during our biggest sales day.”
Contractual penalties: Many B2B contracts include SLAs with financial penalties for breaches.
What Does “High Availability” Actually Mean?
Let’s translate the nines:
| Uptime | Downtime/Year | Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.2 hours |
| 99.9% | 8.76 hours | 43.2 minutes |
| 99.95% | 4.38 hours | 21.6 minutes |
| 99.99% | 52.6 minutes | 4.3 minutes |
For most businesses, 99.9% is the minimum. E-commerce or fintech? Aim for 99.95% or higher.
The Architecture of High Availability
1. Eliminate Single Points of Failure
If one component failing takes down your whole system, that’s your single point of failure (SPOF).
Common SPOFs:
- Single database instance
- One load balancer
- Sole availability zone
- Critical service with no redundancy
The fix: Redundancy at every layer.
# Terraform: Multi-AZ RDS with automatic failover
resource "aws_db_instance" "main" {
identifier = "myapp-db"
multi_az = true
backup_retention_period = 7
# Automatic failover in ~60 seconds
}
2. Design for Failure
Assume everything will fail. Because it will.
- Network: Implement retries with exponential backoff
- Services: Use circuit breakers to prevent cascade failures
- Data: Replicate across regions
- Dependencies: Degrade gracefully when third-party APIs are down
3. Distribute Across Availability Zones
AWS, GCP, and Azure all offer multiple isolated datacenters (AZs) within each region. Use them:
# Kubernetes: Pod anti-affinity
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
app: myapp
This spreads your pods across AZs. One datacenter loses power? Your app keeps running.
4. Implement Health Checks and Auto-Healing
Monitoring detects problems. Auto-healing fixes them.
# ECS task health check
health_check {
command = ["CMD-SHELL", "curl -f http://localhost/health || exit 1"]
interval = 30
timeout = 5
retries = 3
start_period = 60
}
Unhealthy containers get replaced automatically. No manual intervention needed.
5. Plan for Disasters
Backups: Automated, tested, and stored in multiple regions.
Runbooks: Step-by-step recovery procedures for common failures.
Chaos engineering: Intentionally break things to verify your systems recover gracefully.
We run “chaos days” where we randomly terminate instances, partition networks, or introduce latency. If your system can survive chaos engineering, it can survive production.
The Business Case
Let’s say your company does $10M ARR and experiences 99.5% uptime (currently experiencing ~43 hours of downtime per year).
Improving to 99.95% (4.4 hours/year) costs roughly $3K-5K/month in infrastructure and engineering time.
Downtime cost at 99.5%: ~$200K/year
Improvement cost: ~$50K/year
Net benefit: $150K/year
Plus intangible benefits: happier customers, better reputation, less stressed team.
Quick Wins
You don’t need to rebuild everything. Start here:
- Enable Multi-AZ for databases (1 hour, immediate impact)
- Set up automated backups (2 hours)
- Implement health checks (1 day)
- Create a runbook for your most common incident (2 hours)
- Review your monitoring alerts (half day)
Each of these delivers measurable improvement.
When to Invest
Not every system needs five nines. A blog can survive being down for a few minutes. Your payment processor cannot.
Prioritize HA for:
- Revenue-generating systems
- Customer-facing applications
- Data pipelines that feed critical decisions
- Compliance-required systems
Lower priority:
- Internal tools with <10 users
- Prototype/MVP systems
- Read-only content sites
The Bottom Line
High availability isn’t about perfection—it’s about understanding what downtime costs your business and building systems that match that risk profile.
Calculate your cost-per-minute of downtime. If it’s high, invest in HA. If it’s low, maybe 99% is fine.
The teams who get this right sleep better at night.