From 12-Hour Deployments to 8 Minutes: E-Commerce Platform Transformation
E-Commerce / Retail • 3 months
Industry
E-Commerce / Retail
Timeline
3 months
Results
5 Key Wins
Results Achieved
- Deployment time reduced from 12 hours to 8 minutes
- Uptime improved from 97.2% to 99.98%
- Deploy frequency increased from 2x/month to 25x/week
- Incident count reduced by 83%
- Rollback time decreased from 4 hours to 3 minutes
Technology Stack
The Challenge
A growing e-commerce platform was hitting scaling walls. Their deployment process required:
- Manual coordination between 5 teams
- A 12-hour deployment window (Friday nights)
- Database migrations run manually
- Zero-downtime deployments were impossible
- Rollbacks took 4+ hours
Worse, they were experiencing 2-3 production incidents per month, with an average MTTR (mean time to recovery) of 90 minutes.
As the business grew, this became untenable. Black Friday was approaching, and they couldn’t risk a deployment failure during peak season.
Discovery Phase
We spent two weeks understanding their system:
Architecture:
- Monolithic Rails application on EC2
- Single PostgreSQL RDS instance
- Static assets on S3 + CloudFront
- Manual infrastructure provisioning
Pain Points:
- No automated testing in CI
- Manual SSH deploys with shell scripts
- No infrastructure as code
- Single point of failure (one database, one app server pool)
- No observability (logs scattered, no metrics)
Team:
- 12 engineers
- 1 DevOps engineer (overwhelmed)
- No formal incident response process
The Solution
Phase 1: Foundation (Weeks 1-4)
Goal: Establish baseline reliability and observability.
Actions:
- Migrated infrastructure to Terraform
- Enabled Multi-AZ RDS for automatic failover
- Set up structured logging with CloudWatch
- Implemented basic monitoring (Prometheus + Grafana)
- Created runbooks for common incidents
Results:
- Infrastructure now version-controlled
- Database failover tested and working (<60s downtime)
- Clear visibility into system health
Phase 2: Containerization (Weeks 5-8)
Goal: Enable fast, consistent deployments.
Actions:
- Dockerized the Rails application
- Set up ECR (Elastic Container Registry)
- Deployed to ECS Fargate with blue/green deployments
- Configured Application Load Balancer with health checks
- Automated database migrations as part of deployment
Docker optimization:
# Multi-stage build for smaller images
FROM ruby:3.2-alpine AS builder
WORKDIR /app
COPY Gemfile* ./
RUN bundle install --without development test
FROM ruby:3.2-alpine
WORKDIR /app
COPY --from=builder /usr/local/bundle /usr/local/bundle
COPY . .
# Precompile assets
RUN RAILS_ENV=production bundle exec rake assets:precompile
CMD ["bundle", "exec", "puma", "-C", "config/puma.rb"]
ECS Task Definition:
{
"family": "ecommerce-app",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [{
"name": "app",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:latest",
"portMappings": [{
"containerPort": 3000,
"protocol": "tcp"
}],
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/ecommerce",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "app"
}
}
}]
}
Results:
- Consistent environments (dev matches prod exactly)
- Deployments became reproducible
- Rollbacks took <5 minutes
Phase 3: CI/CD Pipeline (Weeks 9-12)
Goal: Automate everything.
GitHub Actions Workflow:
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 3.2
bundler-cache: true
- name: Run tests
run: bundle exec rspec
- name: Run linters
run: bundle exec rubocop
build:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions
aws-region: us-east-1
- name: Login to ECR
run: aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
- name: Build and push image
run: |
docker build -t ecommerce:${{ github.sha }} .
docker tag ecommerce:${{ github.sha }} 123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:${{ github.sha }}
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:${{ github.sha }}
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster production \
--service ecommerce-app \
--force-new-deployment \
--task-definition ecommerce-app:${{ github.sha }}
- name: Wait for deployment
run: aws ecs wait services-stable --cluster production --services ecommerce-app
Database Migration Strategy:
- Run migrations as a separate ECS task before deploying app
- Use backward-compatible migrations (additive changes only)
- Never drop columns until next release
Results:
- Push to main → automated deploy in 8 minutes
- Zero-downtime deployments
- Automatic rollback on health check failures
The Results
Deployment Metrics
| Metric | Before | After | Improvement |
|---|---|---|---|
| Deployment Time | 12 hours | 8 minutes | 98.9% faster |
| Deploy Frequency | 2x/month | 25x/week | 50x increase |
| Rollback Time | 4 hours | 3 minutes | 98.8% faster |
| Failed Deployments | 18% | 0.6% | 97% reduction |
Reliability Metrics
| Metric | Before | After | Improvement |
|---|---|---|---|
| Uptime | 97.2% | 99.98% | 3.7 hours → 1.75 hours downtime/year |
| MTTR | 90 minutes | 8 minutes | 91% faster recovery |
| Incidents/Month | 2.8 | 0.5 | 82% reduction |
Business Impact
- Black Friday: Zero downtime during peak traffic (3x normal load)
- Engineering Velocity: 40% more features shipped per quarter
- Team Morale: Friday night deployments eliminated
Key Lessons
-
Incremental > Big Bang: We didn’t rebuild everything at once. Each phase delivered value.
-
Observability First: You can’t improve what you can’t measure. Logging and monitoring enabled everything else.
-
Automate the Boring Stuff: Manual deployments are error-prone and drain team morale.
-
Test Your Rollbacks: We practiced rollback procedures monthly. When production issues happened, muscle memory kicked in.
-
Blue/Green Deployments Are Worth It: The ability to roll back instantly by rerouting traffic saved us multiple times.
What They Said
“Before BugaOps, deployments were stressful. Now they’re boring—in the best way. We deploy 5 times a day without thinking about it.”
— VP of Engineering
“Our Black Friday was flawless. Zero downtime, zero incidents. That’s never happened before.”
— CTO
Next Steps
Post-project, we’ve continued to support them with:
- Cost optimization (reduced AWS bill by 22% through rightsizing)
- Advanced observability (distributed tracing with OpenTelemetry)
- Disaster recovery planning and testing
Their engineering team is now fully autonomous, making infrastructure changes confidently with Terraform and deploying multiple times per day.
Have a Similar Problem?
Let's talk. We'll figure out if we can help and give you a clear plan.
Book a Free Call