From 12-Hour Deployments to 8 Minutes: E-Commerce Platform Transformation

E-Commerce / Retail • 3 months

From 12-Hour Deployments to 8 Minutes: E-Commerce Platform Transformation

Industry

E-Commerce / Retail

Timeline

3 months

Results

5 Key Wins

Results Achieved

  • Deployment time reduced from 12 hours to 8 minutes
  • Uptime improved from 97.2% to 99.98%
  • Deploy frequency increased from 2x/month to 25x/week
  • Incident count reduced by 83%
  • Rollback time decreased from 4 hours to 3 minutes

Technology Stack

AWS ECS Fargate GitHub Actions Terraform PostgreSQL RDS CloudFront Route53

The Challenge

A growing e-commerce platform was hitting scaling walls. Their deployment process required:

  • Manual coordination between 5 teams
  • A 12-hour deployment window (Friday nights)
  • Database migrations run manually
  • Zero-downtime deployments were impossible
  • Rollbacks took 4+ hours

Worse, they were experiencing 2-3 production incidents per month, with an average MTTR (mean time to recovery) of 90 minutes.

As the business grew, this became untenable. Black Friday was approaching, and they couldn’t risk a deployment failure during peak season.

Discovery Phase

We spent two weeks understanding their system:

Architecture:

  • Monolithic Rails application on EC2
  • Single PostgreSQL RDS instance
  • Static assets on S3 + CloudFront
  • Manual infrastructure provisioning

Pain Points:

  1. No automated testing in CI
  2. Manual SSH deploys with shell scripts
  3. No infrastructure as code
  4. Single point of failure (one database, one app server pool)
  5. No observability (logs scattered, no metrics)

Team:

  • 12 engineers
  • 1 DevOps engineer (overwhelmed)
  • No formal incident response process

The Solution

Phase 1: Foundation (Weeks 1-4)

Goal: Establish baseline reliability and observability.

Actions:

  1. Migrated infrastructure to Terraform
  2. Enabled Multi-AZ RDS for automatic failover
  3. Set up structured logging with CloudWatch
  4. Implemented basic monitoring (Prometheus + Grafana)
  5. Created runbooks for common incidents

Results:

  • Infrastructure now version-controlled
  • Database failover tested and working (<60s downtime)
  • Clear visibility into system health

Phase 2: Containerization (Weeks 5-8)

Goal: Enable fast, consistent deployments.

Actions:

  1. Dockerized the Rails application
  2. Set up ECR (Elastic Container Registry)
  3. Deployed to ECS Fargate with blue/green deployments
  4. Configured Application Load Balancer with health checks
  5. Automated database migrations as part of deployment

Docker optimization:

# Multi-stage build for smaller images
FROM ruby:3.2-alpine AS builder

WORKDIR /app
COPY Gemfile* ./
RUN bundle install --without development test

FROM ruby:3.2-alpine
WORKDIR /app
COPY --from=builder /usr/local/bundle /usr/local/bundle
COPY . .

# Precompile assets
RUN RAILS_ENV=production bundle exec rake assets:precompile

CMD ["bundle", "exec", "puma", "-C", "config/puma.rb"]

ECS Task Definition:

{
  "family": "ecommerce-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [{
    "name": "app",
    "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:latest",
    "portMappings": [{
      "containerPort": 3000,
      "protocol": "tcp"
    }],
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
      "interval": 30,
      "timeout": 5,
      "retries": 3
    },
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/ecommerce",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "app"
      }
    }
  }]
}

Results:

  • Consistent environments (dev matches prod exactly)
  • Deployments became reproducible
  • Rollbacks took <5 minutes

Phase 3: CI/CD Pipeline (Weeks 9-12)

Goal: Automate everything.

GitHub Actions Workflow:

name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Ruby
        uses: ruby/setup-ruby@v1
        with:
          ruby-version: 3.2
          bundler-cache: true
      - name: Run tests
        run: bundle exec rspec
      - name: Run linters
        run: bundle exec rubocop

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-actions
          aws-region: us-east-1
      - name: Login to ECR
        run: aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
      - name: Build and push image
        run: |
          docker build -t ecommerce:${{ github.sha }} .
          docker tag ecommerce:${{ github.sha }} 123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:${{ github.sha }}
          docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ecommerce:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster production \
            --service ecommerce-app \
            --force-new-deployment \
            --task-definition ecommerce-app:${{ github.sha }}
      - name: Wait for deployment
        run: aws ecs wait services-stable --cluster production --services ecommerce-app

Database Migration Strategy:

  • Run migrations as a separate ECS task before deploying app
  • Use backward-compatible migrations (additive changes only)
  • Never drop columns until next release

Results:

  • Push to main → automated deploy in 8 minutes
  • Zero-downtime deployments
  • Automatic rollback on health check failures

The Results

Deployment Metrics

MetricBeforeAfterImprovement
Deployment Time12 hours8 minutes98.9% faster
Deploy Frequency2x/month25x/week50x increase
Rollback Time4 hours3 minutes98.8% faster
Failed Deployments18%0.6%97% reduction

Reliability Metrics

MetricBeforeAfterImprovement
Uptime97.2%99.98%3.7 hours → 1.75 hours downtime/year
MTTR90 minutes8 minutes91% faster recovery
Incidents/Month2.80.582% reduction

Business Impact

  • Black Friday: Zero downtime during peak traffic (3x normal load)
  • Engineering Velocity: 40% more features shipped per quarter
  • Team Morale: Friday night deployments eliminated

Key Lessons

  1. Incremental > Big Bang: We didn’t rebuild everything at once. Each phase delivered value.

  2. Observability First: You can’t improve what you can’t measure. Logging and monitoring enabled everything else.

  3. Automate the Boring Stuff: Manual deployments are error-prone and drain team morale.

  4. Test Your Rollbacks: We practiced rollback procedures monthly. When production issues happened, muscle memory kicked in.

  5. Blue/Green Deployments Are Worth It: The ability to roll back instantly by rerouting traffic saved us multiple times.

What They Said

“Before BugaOps, deployments were stressful. Now they’re boring—in the best way. We deploy 5 times a day without thinking about it.”

— VP of Engineering

“Our Black Friday was flawless. Zero downtime, zero incidents. That’s never happened before.”

— CTO

Next Steps

Post-project, we’ve continued to support them with:

  • Cost optimization (reduced AWS bill by 22% through rightsizing)
  • Advanced observability (distributed tracing with OpenTelemetry)
  • Disaster recovery planning and testing

Their engineering team is now fully autonomous, making infrastructure changes confidently with Terraform and deploying multiple times per day.

Have a Similar Problem?

Let's talk. We'll figure out if we can help and give you a clear plan.

Book a Free Call