Cloud Cost Optimization: $840K Annual Savings
SaaS / Technology • 2 months initial optimization, ongoing refinement
Industry
SaaS / Technology
Timeline
2 months initial optimization, ongoing refinement
Results
5 Key Wins
Results Achieved
- Annual savings: $840,000 (38% reduction)
- Per-customer cost reduced 45%
- Performance improved despite cost cuts
- Established FinOps practice with monthly cost reviews
- Forecasting accuracy improved from ±40% to ±5%
Technology Stack
The Problem
A B2B SaaS startup had grown from $500K to $15M ARR in 18 months. Great news—except their AWS bill grew even faster:
| Month | Revenue | AWS Cost | % of Revenue |
|---|---|---|---|
| Jan | $1.0M | $45K | 4.5% |
| Jul | $1.2M | $75K | 6.3% |
| Dec | $1.4M | $115K | 8.2% |
At this trajectory, cloud costs would hit 12% of revenue within a year—unsustainable for a SaaS business (target: 3-5%).
Worse, nobody knew why costs were increasing. The team was “too busy shipping features” to investigate.
The board mandated: Cut cloud costs by 30% within 90 days.
Discovery: Where’s the Money Going?
We conducted a week-long audit using AWS Cost Explorer, CloudWatch, and custom analysis scripts.
Finding #1: Over-Provisioned Instances
The Issue:
- 120 EC2 instances, mostly m5.4xlarge (16 vCPU, 64GB RAM)
- Average CPU utilization: 12%
- Average memory utilization: 28%
Why it happened: Engineers provisioned for peak load (Black Friday), then never scaled down.
The fix: Rightsized 80% of instances to m5.xlarge or m5.2xlarge.
# Analysis script to find underutilized instances
aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,InstanceType]' --output text | while read instance type; do
cpu_avg=$(aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$instance \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average \
--query 'Datapoints[*].Average' \
--output text | awk '{sum+=$1; count++} END {print sum/count}')
if (( $(echo "$cpu_avg < 20" | bc -l) )); then
echo "Instance $instance ($type) - CPU: $cpu_avg% - RIGHTSIZE CANDIDATE"
fi
done
Savings: $18K/month
Finding #2: No Reserved Instances or Savings Plans
The Issue: All compute running on-demand. With stable, predictable workloads.
The fix:
- Purchased 1-year Compute Savings Plans (covers EC2, Fargate, Lambda)
- Committed to 70% of baseline usage
- Reserved 30% for on-demand scaling
Savings: $32K/month (40% discount on committed usage)
Finding #3: Massive S3 Costs
The Issue:
- S3 bill: $28K/month
- 480TB of data, 95% never accessed after 30 days
- Everything in S3 Standard storage class
The fix: Implemented lifecycle policies:
{
"Rules": [{
"Id": "MoveToInfrequentAccess",
"Status": "Enabled",
"Filter": {
"Prefix": "uploads/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
]
}]
}
Savings: $19K/month (68% reduction in S3 storage costs)
Finding #4: RDS Over-Provisioning
The Issue:
- Production database: db.r5.8xlarge (32 vCPU, 256GB RAM)
- Actual usage: 15% CPU, 40% memory
- Multi-AZ (good!), but oversized
The fix:
- Downsized to db.r5.2xlarge (still multi-AZ)
- Purchased Reserved Instance (1-year, all upfront)
- Set up CloudWatch alarms to monitor performance
Savings: $11K/month
Performance impact: None. P95 query latency actually improved (better cache hit ratios with right-sized instance).
Finding #5: Data Transfer Costs
The Issue:
- $8K/month in data transfer
- API responses averaging 2MB (mostly redundant data)
- Images served from S3 without CloudFront
The fix:
- Optimized API responses (removed unnecessary fields)
- Implemented CloudFront CDN for static assets
- Enabled gzip compression
Savings: $5K/month
Finding #6: Zombie Resources
The Issue:
- 34 EBS volumes not attached to instances ($2.1K/month)
- 18 Elastic IPs not associated with instances ($4.3K/month)
- 12 load balancers with zero traffic ($2.9K/month)
- Snapshots retained indefinitely (8TB, $1.6K/month)
The fix:
# Delete unattached EBS volumes
aws ec2 describe-volumes --filters Name=status,Values=available \
--query 'Volumes[*].VolumeId' --output text | \
xargs -I {} aws ec2 delete-volume --volume-id {}
# Release unassociated Elastic IPs
aws ec2 describe-addresses --filters Name=instance-id,Values="" \
--query 'Addresses[*].AllocationId' --output text | \
xargs -I {} aws ec2 release-address --allocation-id {}
# Delete idle load balancers
# (manual review first to avoid breaking things)
# Implement snapshot retention policy (keep 7 daily, 4 weekly, 12 monthly)
Savings: $11K/month
Finding #7: Development/Staging Running 24/7
The Issue:
- Dev and staging environments identical to production
- Running 24/7, even though devs only worked 9-5 weekdays
- Cost: $22K/month
The fix: Auto-shutdown non-prod environments:
# Lambda function to stop dev/staging at 7 PM weekdays
import boto3
from datetime import datetime
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
rds = boto3.client('rds')
# Stop EC2 instances tagged Environment=dev or Environment=staging
ec2.stop_instances(
InstanceIds=get_non_prod_instances(ec2)
)
# Stop RDS instances
for db in get_non_prod_databases(rds):
rds.stop_db_instance(DBInstanceIdentifier=db)
return {'statusCode': 200, 'body': 'Non-prod environments stopped'}
Developers start environments when needed via Slack bot.
Savings: $14K/month (64% reduction in non-prod costs)
The Results
Cost Breakdown
| Category | Before | After | Monthly Savings |
|---|---|---|---|
| EC2 Compute | $42K | $24K | $18K |
| Savings Plans | — | — | $32K |
| S3 Storage | $28K | $9K | $19K |
| RDS | $18K | $7K | $11K |
| Data Transfer | $8K | $3K | $5K |
| Zombie Resources | $11K | — | $11K |
| Dev/Staging | $22K | $8K | $14K |
| Total | $129K | $59K | $70K/month |
Annual savings: $840,000
Business Impact
- Gross margin improved from 68% to 75%
- Runway extended by 4 months without additional fundraising
- Per-customer cost dropped 45% (better unit economics)
- Performance improved: Lower latency (CloudFront), faster queries (rightsized RDS)
Process Improvements
Implemented FinOps practice:
- Monthly cost review meeting (30 minutes)
- Cost attribution by team/product (tagging strategy)
- Budget alerts (notify when exceeding forecast by 10%)
- Quarterly rightsizing reviews
Cost forecasting:
- Before: ±40% accuracy (basically guessing)
- After: ±5% accuracy (reliable planning)
Developer awareness: Added cost visibility to CI/CD:
# GitHub Action: Estimate infrastructure cost changes
name: Cost Estimation
on: [pull_request]
jobs:
cost:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Infracost
uses: infracost/actions/comment@v1
with:
path: terraform/
Pull requests now show: “This change will increase monthly cost by $127.”
Lessons Learned
1. Start with low-hanging fruit
Zombie resources and lifecycle policies are easy wins. We achieved 20% savings in the first week.
2. Measure everything
Can’t optimize what you don’t measure. Tag resources, track utilization, monitor trends.
3. Automate cost controls
Auto-shutdown dev environments, lifecycle policies, budget alerts. Don’t rely on manual discipline.
4. Rightsizing > Spot Instances (for most workloads)
Spot instances save more but add complexity. Rightsizing is simpler and safer for production.
5. Reserved Instances are free money
If you have predictable workloads, not buying RIs/Savings Plans is leaving money on the table.
6. Cost optimization is ongoing
It’s not a one-time project. Resources drift, usage patterns change. Review quarterly.
Common Objections (and Responses)
“Optimization takes engineering time away from features.”
True. But so do outages from over-complicated infrastructure. Plus, lower costs = longer runway = more time to build.
“We’ll optimize once we’re bigger.”
Cost inefficiencies compound. A startup wasting 30% at $100K/month wastes $360K/year. At $1M/month? $3.6M/year.
“Cloud costs are just the price of doing business.”
No. Many successful SaaS companies run at 3-5% of revenue. If you’re at 10%+, you have problems.
“Our architecture is already optimized.”
Every team we’ve worked with thought this. Every time we found 20-40% savings.
Quick Wins You Can Implement Today
- Find zombie resources: Unattached EBS volumes, idle load balancers, unused Elastic IPs
- Enable S3 Intelligent-Tiering: Automatic cost optimization with zero config
- Set up budget alerts: Get notified when spending exceeds forecast
- Review EC2 instance types: Check CloudWatch CPU/memory utilization
- Auto-shutdown dev/staging: Save 60% on non-prod environments
Each of these takes <1 hour to implement.
The Bottom Line
Cloud costs don’t have to spiral out of control. With systematic analysis and automation, most companies can cut 20-40% without impacting performance.
This client went from a cost crisis to best-in-class unit economics in 90 days. Their cloud costs now scale linearly with revenue—exactly what a healthy SaaS business should see.
Have a Similar Problem?
Let's talk. We'll figure out if we can help and give you a clear plan.
Book a Free Call