From 12-Hour Deployments to 8 Minutes: E-Commerce Platform Transformation

E-Commerce / Retail • 3 months

Industry

E-Commerce / Retail

Timeline

3 months

Results

5 Key Wins

Results Achieved

Deployment time reduced from 12 hours to 8 minutes
Uptime improved from 97.2% to 99.98%
Deploy frequency increased from 2x/month to 25x/week
Incident count reduced by 83%
Rollback time decreased from 4 hours to 3 minutes

Technology Stack

AWS ECS Fargate GitHub Actions Terraform PostgreSQL RDS CloudFront Route53

The Challenge

A growing e-commerce platform was hitting scaling walls. Their deployment process required:

Manual coordination between 5 teams
A 12-hour deployment window (Friday nights)
Database migrations run manually
Zero-downtime deployments were impossible
Rollbacks took 4+ hours

Worse, they were experiencing 2-3 production incidents per month, with an average MTTR (mean time to recovery) of 90 minutes.

As the business grew, this became untenable. Black Friday was approaching, and they couldn’t risk a deployment failure during peak season.

Discovery Phase

We spent two weeks understanding their system:

Architecture:

Monolithic Rails application on EC2
Single PostgreSQL RDS instance
Static assets on S3 + CloudFront
Manual infrastructure provisioning

Pain Points:

No automated testing in CI
Manual SSH deploys with shell scripts
No infrastructure as code
Single point of failure (one database, one app server pool)
No observability (logs scattered, no metrics)

Team:

12 engineers
1 DevOps engineer (overwhelmed)
No formal incident response process

The Solution

Phase 1: Foundation (Weeks 1-4)

Goal: Establish baseline reliability and observability.

Actions:

Migrated infrastructure to Terraform
Enabled Multi-AZ RDS for automatic failover
Set up structured logging with CloudWatch
Implemented basic monitoring (Prometheus + Grafana)
Created runbooks for common incidents

Results:

Infrastructure now version-controlled
Database failover tested and working (<60s downtime)
Clear visibility into system health

Phase 2: Containerization (Weeks 5-8)

Goal: Enable fast, consistent deployments.

Actions:

Dockerized the Rails application
Set up ECR (Elastic Container Registry)
Deployed to ECS Fargate with blue/green deployments
Configured Application Load Balancer with health checks
Automated database migrations as part of deployment

Docker optimization:

# Multi-stage build for smaller images
FROM ruby:3.2-alpine AS builder

WORKDIR /app
COPY Gemfile* ./
RUN bundle install --without development test

FROM ruby:3.2-alpine
WORKDIR /app
COPY --from=builder /usr/local/bundle /usr/local/bundle
COPY . .

# Precompile assets
RUN RAILS_ENV=production bundle exec rake assets:precompile

CMD ["bundle", "exec", "puma", "-C", "config/puma.rb"]

ECS Task Definition:

{
  "family": "ecommerce-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [{
    "name": "app",
    "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:latest",
    "portMappings": [{
      "containerPort": 3000,
      "protocol": "tcp"
    }],
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
      "interval": 30,
      "timeout": 5,
      "retries": 3
    },
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/ecommerce",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "app"
      }
    }
  }]
}

Results:

Consistent environments (dev matches prod exactly)
Deployments became reproducible
Rollbacks took <5 minutes

Phase 3: CI/CD Pipeline (Weeks 9-12)

Goal: Automate everything.

GitHub Actions Workflow:

name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Ruby
        uses: ruby/setup-ruby@v1
        with:
          ruby-version: 3.2
          bundler-cache: true
      - name: Run tests
        run: bundle exec rspec
      - name: Run linters
        run: bundle exec rubocop

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions
          aws-region: us-east-1
      - name: Login to ECR
        run: aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
      - name: Build and push image
        run: |
          docker build -t ecommerce:${{ github.sha }} .
          docker tag ecommerce:${{ github.sha }} 123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:${{ github.sha }}
          docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster production \
            --service ecommerce-app \
            --force-new-deployment \
            --task-definition ecommerce-app:${{ github.sha }}
      - name: Wait for deployment
        run: aws ecs wait services-stable --cluster production --services ecommerce-app

Database Migration Strategy:

Run migrations as a separate ECS task before deploying app
Use backward-compatible migrations (additive changes only)
Never drop columns until next release

Results:

Push to main → automated deploy in 8 minutes
Zero-downtime deployments
Automatic rollback on health check failures

The Results

Deployment Metrics

Metric	Before	After	Improvement
Deployment Time	12 hours	8 minutes	98.9% faster
Deploy Frequency	2x/month	25x/week	50x increase
Rollback Time	4 hours	3 minutes	98.8% faster
Failed Deployments	18%	0.6%	97% reduction

Reliability Metrics

Metric	Before	After	Improvement
Uptime	97.2%	99.98%	3.7 hours → 1.75 hours downtime/year
MTTR	90 minutes	8 minutes	91% faster recovery
Incidents/Month	2.8	0.5	82% reduction

Business Impact

Black Friday: Zero downtime during peak traffic (3x normal load)
Engineering Velocity: 40% more features shipped per quarter
Team Morale: Friday night deployments eliminated

Key Lessons

Incremental > Big Bang: We didn’t rebuild everything at once. Each phase delivered value.
Observability First: You can’t improve what you can’t measure. Logging and monitoring enabled everything else.
Automate the Boring Stuff: Manual deployments are error-prone and drain team morale.
Test Your Rollbacks: We practiced rollback procedures monthly. When production issues happened, muscle memory kicked in.
Blue/Green Deployments Are Worth It: The ability to roll back instantly by rerouting traffic saved us multiple times.

What They Said

“Before BugaOps, deployments were stressful. Now they’re boring—in the best way. We deploy 5 times a day without thinking about it.”

— VP of Engineering

“Our Black Friday was flawless. Zero downtime, zero incidents. That’s never happened before.”

— CTO

Next Steps

Post-project, we’ve continued to support them with:

Cost optimization (reduced AWS bill by 22% through rightsizing)
Advanced observability (distributed tracing with OpenTelemetry)
Disaster recovery planning and testing

Their engineering team is now fully autonomous, making infrastructure changes confidently with Terraform and deploying multiple times per day.

Have a Similar Problem?

Let's talk. We'll figure out if we can help and give you a clear plan.

Book a Free Call