Observability 101: Metrics, Logs, and Traces That Actually Help

July 18, 2024 • 5 min read

Observability 101: Metrics, Logs, and Traces That Actually Help

Published

July 18, 2024

Reading Time

5 min

Topics

3 Tags

Observability Monitoring Debugging

What You'll Learn

This article breaks down observability 101: metrics, logs, and traces that actually help into practical, actionable steps you can implement today.

“The service is slow” is a symptom. Observability helps you find the cause.

Traditional monitoring tells you what is broken. Observability tells you why.

The Three Pillars

Metrics: The “What”

Metrics are numbers over time. CPU usage, request latency, error rates.

Good metrics are:

  • Aggregatable: You can sum, average, or percentile them
  • Cardinality-conscious: Don’t track metrics with millions of unique combinations
  • Actionable: Each metric should inform a decision
# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge

request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

The Golden Signals:

  1. Latency: How long requests take
  2. Traffic: How many requests you’re handling
  3. Errors: How many requests are failing
  4. Saturation: How “full” your service is

Track these four for every service.

Logs: The “Details”

Logs are discrete events with context.

Structured logging beats plain text:

# Bad
logger.info(f"User {user_id} logged in from {ip_address}")

# Good
logger.info(
    "User logged in",
    extra={
        "event": "user_login",
        "user_id": user_id,
        "ip_address": ip_address,
        "session_id": session_id,
        "timestamp": datetime.utcnow().isoformat()
    }
)

Structured logs are queryable. You can ask “Show me all failed login attempts for user 1234 in the last hour.”

Log levels matter:

  • DEBUG: Detailed internal state (disabled in prod)
  • INFO: Normal operations (login, logout, job completed)
  • WARN: Something unexpected but handled (retry succeeded, deprecated API used)
  • ERROR: Something failed that shouldn’t (database connection lost, payment failed)
  • CRITICAL: System-level failure (out of memory, data corruption)

Don’t log everything at INFO. We’ve seen systems generating 10GB of logs per hour because someone logged every database query.

Traces: The “Journey”

Traces follow a single request across multiple services.

Scenario: A user loads a page, which:

  1. Hits the API gateway (12ms)
  2. Calls the auth service (45ms)
  3. Queries the database (230ms)
  4. Calls a third-party API (890ms)
  5. Returns a response

Total: 1,177ms. But where is the time spent?

Traces show you: 75% of the time is in that third-party API call.

# OpenTelemetry tracing
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("validate_payment"):
            validate_payment(order_id)
        
        with tracer.start_as_current_span("update_inventory"):
            update_inventory(order_id)
        
        with tracer.start_as_current_span("send_confirmation"):
            send_confirmation_email(order_id)

Now you see exactly which operation is slow.

Putting It Together

Example: Debugging High Latency

  1. Metrics show P95 latency spiked from 200ms to 3 seconds
  2. Traces reveal 90% of requests are slow on the database query
  3. Logs show “Query timeout after 2900ms” errors
  4. Metrics (database) show CPU at 95%

Root cause: Slow query overwhelming the database.

Without observability, you’re guessing. With it, you have a clear path to the problem.

Practical Implementation

Start with the Basics

Week 1:

  • Add structured logging
  • Instrument HTTP requests (latency, status codes)
  • Set up basic dashboards

Week 2:

  • Add database query metrics
  • Implement health check endpoints
  • Configure alerting on error rate spikes

Week 3:

  • Implement distributed tracing
  • Add custom business metrics
  • Create runbooks for common issues

Pick Your Stack

Budget-friendly:

  • Prometheus + Grafana (metrics + dashboards)
  • Loki (logs)
  • Jaeger (traces)

Cloud-native:

  • CloudWatch (AWS)
  • Cloud Monitoring (GCP)
  • Azure Monitor

Observability platforms:

  • Datadog
  • New Relic
  • Honeycomb
  • Grafana Cloud

Choose based on your team’s size, budget, and complexity. Don’t over-engineer.

Alerting: The Hard Part

Alerts should be actionable and rare.

Bad alert: “CPU > 80% for 5 minutes”

What do you do? Maybe nothing—maybe 80% is normal during batch jobs.

Good alert: “P95 latency > 1s for 10 minutes AND error rate > 5% for 5 minutes”

This indicates user impact. It requires action.

Alert Fatigue is Real

If your team ignores pages, your alerts are wrong. Common mistakes:

  • Too sensitive: Alerts fire constantly for non-issues
  • Too late: By the time it alerts, customers have already complained
  • Not actionable: “Something is wrong” with no guidance on what to do

Fix this:

  1. Every alert must have a runbook
  2. Review alerts monthly—delete ones that never lead to action
  3. Use warning vs. critical tiers

Observability for Business Metrics

Don’t just monitor infrastructure—track business KPIs:

# Track business events
revenue_counter = Counter('revenue_total', 'Total revenue in cents', ['currency'])
signup_counter = Counter('user_signups_total', 'Total user signups', ['plan_type'])
checkout_duration = Histogram('checkout_duration_seconds', 'Time to complete checkout')

# Use them
revenue_counter.labels(currency='USD').inc(4999)  # $49.99
signup_counter.labels(plan_type='premium').inc()

Now you can alert on “signups dropped 50% in the last hour”—a business problem, not just a technical one.

Common Pitfalls

Over-instrumenting: Don’t track everything. Focus on what matters.

Under-sampling: Sampling is fine for traces, but not for error logs—you want every error.

Ignoring cardinality: Metrics with user IDs or session IDs explode your monitoring costs.

No retention strategy: Logs and traces are expensive. Keep high-resolution data for 7-30 days, then aggregate or delete.

The Payoff

With good observability:

  • MTTR (mean time to recovery) drops from hours to minutes
  • You catch issues before customers report them
  • Debugging shifts from “guess and test” to “look and fix”
  • You make data-driven infrastructure decisions

One client reduced incident response time from 90 minutes to 12 minutes after implementing structured observability.

That’s the difference between a 3-hour outage and a 15-minute blip.

Start Simple

You don’t need a perfect observability setup. Start with:

  1. Structured logging
  2. The four golden signals
  3. One dashboard everyone looks at

Build from there. Observability is a practice, not a project.

Questions About This?

We implement these strategies for clients every day. Want to discuss how they apply to your infrastructure?

Let's Talk

Need Help Implementing This?

Let's talk. We'll figure out how to apply these concepts to your infrastructure.

Book a Free Call