Observability 101: Metrics, Logs, and Traces That Actually Help

“The service is slow” is a symptom. Observability helps you find the cause.

Traditional monitoring tells you what is broken. Observability tells you why.

The Three Pillars

Metrics: The “What”

Metrics are numbers over time. CPU usage, request latency, error rates.

Good metrics are:

Aggregatable: You can sum, average, or percentile them
Cardinality-conscious: Don’t track metrics with millions of unique combinations
Actionable: Each metric should inform a decision

# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge

request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

The Golden Signals:

Latency: How long requests take
Traffic: How many requests you’re handling
Errors: How many requests are failing
Saturation: How “full” your service is

Track these four for every service.

Logs: The “Details”

Logs are discrete events with context.

Structured logging beats plain text:

# Bad
logger.info(f"User {user_id} logged in from {ip_address}")

# Good
logger.info(
    "User logged in",
    extra={
        "event": "user_login",
        "user_id": user_id,
        "ip_address": ip_address,
        "session_id": session_id,
        "timestamp": datetime.utcnow().isoformat()
    }
)

Structured logs are queryable. You can ask “Show me all failed login attempts for user 1234 in the last hour.”

Log levels matter:

DEBUG: Detailed internal state (disabled in prod)
INFO: Normal operations (login, logout, job completed)
WARN: Something unexpected but handled (retry succeeded, deprecated API used)
ERROR: Something failed that shouldn’t (database connection lost, payment failed)
CRITICAL: System-level failure (out of memory, data corruption)

Don’t log everything at INFO. We’ve seen systems generating 10GB of logs per hour because someone logged every database query.

Traces: The “Journey”

Traces follow a single request across multiple services.

Scenario: A user loads a page, which:

Hits the API gateway (12ms)
Calls the auth service (45ms)
Queries the database (230ms)
Calls a third-party API (890ms)
Returns a response

Total: 1,177ms. But where is the time spent?

Traces show you: 75% of the time is in that third-party API call.

# OpenTelemetry tracing
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        with tracer.start_as_current_span("validate_payment"):
            validate_payment(order_id)
        
        with tracer.start_as_current_span("update_inventory"):
            update_inventory(order_id)
        
        with tracer.start_as_current_span("send_confirmation"):
            send_confirmation_email(order_id)

Now you see exactly which operation is slow.

Putting It Together

Example: Debugging High Latency

Metrics show P95 latency spiked from 200ms to 3 seconds
Traces reveal 90% of requests are slow on the database query
Logs show “Query timeout after 2900ms” errors
Metrics (database) show CPU at 95%

Root cause: Slow query overwhelming the database.

Without observability, you’re guessing. With it, you have a clear path to the problem.

Practical Implementation

Start with the Basics

Week 1:

Add structured logging
Instrument HTTP requests (latency, status codes)
Set up basic dashboards

Week 2:

Add database query metrics
Implement health check endpoints
Configure alerting on error rate spikes

Week 3:

Implement distributed tracing
Add custom business metrics
Create runbooks for common issues

Pick Your Stack

Budget-friendly:

Prometheus + Grafana (metrics + dashboards)
Loki (logs)
Jaeger (traces)

Cloud-native:

CloudWatch (AWS)
Cloud Monitoring (GCP)
Azure Monitor

Observability platforms:

Datadog
New Relic
Honeycomb
Grafana Cloud

Choose based on your team’s size, budget, and complexity. Don’t over-engineer.

Alerting: The Hard Part

Alerts should be actionable and rare.

Bad alert: “CPU > 80% for 5 minutes”

What do you do? Maybe nothing—maybe 80% is normal during batch jobs.

Good alert: “P95 latency > 1s for 10 minutes AND error rate > 5% for 5 minutes”

This indicates user impact. It requires action.

Alert Fatigue is Real

If your team ignores pages, your alerts are wrong. Common mistakes:

Too sensitive: Alerts fire constantly for non-issues
Too late: By the time it alerts, customers have already complained
Not actionable: “Something is wrong” with no guidance on what to do

Fix this:

Every alert must have a runbook
Review alerts monthly—delete ones that never lead to action
Use warning vs. critical tiers

Observability for Business Metrics

Don’t just monitor infrastructure—track business KPIs:

# Track business events
revenue_counter = Counter('revenue_total', 'Total revenue in cents', ['currency'])
signup_counter = Counter('user_signups_total', 'Total user signups', ['plan_type'])
checkout_duration = Histogram('checkout_duration_seconds', 'Time to complete checkout')

# Use them
revenue_counter.labels(currency='USD').inc(4999)  # $49.99
signup_counter.labels(plan_type='premium').inc()

Now you can alert on “signups dropped 50% in the last hour”—a business problem, not just a technical one.

Common Pitfalls

Over-instrumenting: Don’t track everything. Focus on what matters.

Under-sampling: Sampling is fine for traces, but not for error logs—you want every error.

Ignoring cardinality: Metrics with user IDs or session IDs explode your monitoring costs.

No retention strategy: Logs and traces are expensive. Keep high-resolution data for 7-30 days, then aggregate or delete.

The Payoff

With good observability:

MTTR (mean time to recovery) drops from hours to minutes
You catch issues before customers report them
Debugging shifts from “guess and test” to “look and fix”
You make data-driven infrastructure decisions

One client reduced incident response time from 90 minutes to 12 minutes after implementing structured observability.

That’s the difference between a 3-hour outage and a 15-minute blip.

Start Simple

You don’t need a perfect observability setup. Start with:

Structured logging
The four golden signals
One dashboard everyone looks at

Build from there. Observability is a practice, not a project.