— July 18, 2024  ·  5 min read

Observability 101: Metrics, Logs, and Traces That Actually Help

Move beyond basic monitoring to observability that helps you understand and debug production systems.

Observability Monitoring Debugging
Observability 101: Metrics, Logs, and Traces That Actually Help

“The service is slow” is a symptom. Observability helps you find the cause.

Traditional monitoring tells you what is broken. Observability tells you why.

The Three Pillars

Metrics: The “What”

Metrics are numbers over time. CPU usage, request latency, error rates.

Good metrics are:

  • Aggregatable: You can sum, average, or percentile them
  • Cardinality-conscious: Don’t track metrics with millions of unique combinations
  • Actionable: Each metric should inform a decision
# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge

request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

The Golden Signals:

  1. Latency: How long requests take
  2. Traffic: How many requests you’re handling
  3. Errors: How many requests are failing
  4. Saturation: How “full” your service is

Track these four for every service.

Logs: The “Details”

Logs are discrete events with context.

Structured logging beats plain text:

# Bad
logger.info(f"User {user_id} logged in from {ip_address}")

# Good
logger.info(
    "User logged in",
    extra={
        "event": "user_login",
        "user_id": user_id,
        "ip_address": ip_address,
        "session_id": session_id,
        "timestamp": datetime.utcnow().isoformat()
    }
)

Structured logs are queryable. You can ask “Show me all failed login attempts for user 1234 in the last hour.”

Log levels matter:

  • DEBUG: Detailed internal state (disabled in prod)
  • INFO: Normal operations (login, logout, job completed)
  • WARN: Something unexpected but handled (retry succeeded, deprecated API used)
  • ERROR: Something failed that shouldn’t (database connection lost, payment failed)
  • CRITICAL: System-level failure (out of memory, data corruption)

Don’t log everything at INFO. We’ve seen systems generating 10GB of logs per hour because someone logged every database query.

Traces: The “Journey”

Traces follow a single request across multiple services.

Scenario: A user loads a page, which:

  1. Hits the API gateway (12ms)
  2. Calls the auth service (45ms)
  3. Queries the database (230ms)
  4. Calls a third-party API (890ms)
  5. Returns a response

Total: 1,177ms. But where is the time spent?

Traces show you: 75% of the time is in that third-party API call.

# OpenTelemetry tracing
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("validate_payment"):
            validate_payment(order_id)
        
        with tracer.start_as_current_span("update_inventory"):
            update_inventory(order_id)
        
        with tracer.start_as_current_span("send_confirmation"):
            send_confirmation_email(order_id)

Now you see exactly which operation is slow.

Putting It Together

Example: Debugging High Latency

  1. Metrics show P95 latency spiked from 200ms to 3 seconds
  2. Traces reveal 90% of requests are slow on the database query
  3. Logs show “Query timeout after 2900ms” errors
  4. Metrics (database) show CPU at 95%

Root cause: Slow query overwhelming the database.

Without observability, you’re guessing. With it, you have a clear path to the problem.

Practical Implementation

Start with the Basics

Week 1:

  • Add structured logging
  • Instrument HTTP requests (latency, status codes)
  • Set up basic dashboards

Week 2:

  • Add database query metrics
  • Implement health check endpoints
  • Configure alerting on error rate spikes

Week 3:

  • Implement distributed tracing
  • Add custom business metrics
  • Create runbooks for common issues

Pick Your Stack

Budget-friendly:

  • Prometheus + Grafana (metrics + dashboards)
  • Loki (logs)
  • Jaeger (traces)

Cloud-native:

  • CloudWatch (AWS)
  • Cloud Monitoring (GCP)
  • Azure Monitor

Observability platforms:

  • Datadog
  • New Relic
  • Honeycomb
  • Grafana Cloud

Choose based on your team’s size, budget, and complexity. Don’t over-engineer.

Alerting: The Hard Part

Alerts should be actionable and rare.

Bad alert: “CPU > 80% for 5 minutes”

What do you do? Maybe nothing—maybe 80% is normal during batch jobs.

Good alert: “P95 latency > 1s for 10 minutes AND error rate > 5% for 5 minutes”

This indicates user impact. It requires action.

Alert Fatigue is Real

If your team ignores pages, your alerts are wrong. Common mistakes:

  • Too sensitive: Alerts fire constantly for non-issues
  • Too late: By the time it alerts, customers have already complained
  • Not actionable: “Something is wrong” with no guidance on what to do

Fix this:

  1. Every alert must have a runbook
  2. Review alerts monthly—delete ones that never lead to action
  3. Use warning vs. critical tiers

Observability for Business Metrics

Don’t just monitor infrastructure—track business KPIs:

# Track business events
revenue_counter = Counter('revenue_total', 'Total revenue in cents', ['currency'])
signup_counter = Counter('user_signups_total', 'Total user signups', ['plan_type'])
checkout_duration = Histogram('checkout_duration_seconds', 'Time to complete checkout')

# Use them
revenue_counter.labels(currency='USD').inc(4999)  # $49.99
signup_counter.labels(plan_type='premium').inc()

Now you can alert on “signups dropped 50% in the last hour”—a business problem, not just a technical one.

Common Pitfalls

Over-instrumenting: Don’t track everything. Focus on what matters.

Under-sampling: Sampling is fine for traces, but not for error logs—you want every error.

Ignoring cardinality: Metrics with user IDs or session IDs explode your monitoring costs.

No retention strategy: Logs and traces are expensive. Keep high-resolution data for 7-30 days, then aggregate or delete.

The Payoff

With good observability:

  • MTTR (mean time to recovery) drops from hours to minutes
  • You catch issues before customers report them
  • Debugging shifts from “guess and test” to “look and fix”
  • You make data-driven infrastructure decisions

One client reduced incident response time from 90 minutes to 12 minutes after implementing structured observability.

That’s the difference between a 3-hour outage and a 15-minute blip.

Start Simple

You don’t need a perfect observability setup. Start with:

  1. Structured logging
  2. The four golden signals
  3. One dashboard everyone looks at

Build from there. Observability is a practice, not a project.

— Questions?

Need help applying this?

We implement these strategies for clients every day. Want to talk about your infrastructure?

Start a conversation