“The service is slow” is a symptom. Observability helps you find the cause.
Traditional monitoring tells you what is broken. Observability tells you why.
The Three Pillars
Metrics: The “What”
Metrics are numbers over time. CPU usage, request latency, error rates.
Good metrics are:
- Aggregatable: You can sum, average, or percentile them
- Cardinality-conscious: Don’t track metrics with millions of unique combinations
- Actionable: Each metric should inform a decision
# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
The Golden Signals:
- Latency: How long requests take
- Traffic: How many requests you’re handling
- Errors: How many requests are failing
- Saturation: How “full” your service is
Track these four for every service.
Logs: The “Details”
Logs are discrete events with context.
Structured logging beats plain text:
# Bad
logger.info(f"User {user_id} logged in from {ip_address}")
# Good
logger.info(
"User logged in",
extra={
"event": "user_login",
"user_id": user_id,
"ip_address": ip_address,
"session_id": session_id,
"timestamp": datetime.utcnow().isoformat()
}
)
Structured logs are queryable. You can ask “Show me all failed login attempts for user 1234 in the last hour.”
Log levels matter:
- DEBUG: Detailed internal state (disabled in prod)
- INFO: Normal operations (login, logout, job completed)
- WARN: Something unexpected but handled (retry succeeded, deprecated API used)
- ERROR: Something failed that shouldn’t (database connection lost, payment failed)
- CRITICAL: System-level failure (out of memory, data corruption)
Don’t log everything at INFO. We’ve seen systems generating 10GB of logs per hour because someone logged every database query.
Traces: The “Journey”
Traces follow a single request across multiple services.
Scenario: A user loads a page, which:
- Hits the API gateway (12ms)
- Calls the auth service (45ms)
- Queries the database (230ms)
- Calls a third-party API (890ms)
- Returns a response
Total: 1,177ms. But where is the time spent?
Traces show you: 75% of the time is in that third-party API call.
# OpenTelemetry tracing
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("validate_payment"):
validate_payment(order_id)
with tracer.start_as_current_span("update_inventory"):
update_inventory(order_id)
with tracer.start_as_current_span("send_confirmation"):
send_confirmation_email(order_id)
Now you see exactly which operation is slow.
Putting It Together
Example: Debugging High Latency
- Metrics show P95 latency spiked from 200ms to 3 seconds
- Traces reveal 90% of requests are slow on the database query
- Logs show “Query timeout after 2900ms” errors
- Metrics (database) show CPU at 95%
Root cause: Slow query overwhelming the database.
Without observability, you’re guessing. With it, you have a clear path to the problem.
Practical Implementation
Start with the Basics
Week 1:
- Add structured logging
- Instrument HTTP requests (latency, status codes)
- Set up basic dashboards
Week 2:
- Add database query metrics
- Implement health check endpoints
- Configure alerting on error rate spikes
Week 3:
- Implement distributed tracing
- Add custom business metrics
- Create runbooks for common issues
Pick Your Stack
Budget-friendly:
- Prometheus + Grafana (metrics + dashboards)
- Loki (logs)
- Jaeger (traces)
Cloud-native:
- CloudWatch (AWS)
- Cloud Monitoring (GCP)
- Azure Monitor
Observability platforms:
- Datadog
- New Relic
- Honeycomb
- Grafana Cloud
Choose based on your team’s size, budget, and complexity. Don’t over-engineer.
Alerting: The Hard Part
Alerts should be actionable and rare.
Bad alert: “CPU > 80% for 5 minutes”
What do you do? Maybe nothing—maybe 80% is normal during batch jobs.
Good alert: “P95 latency > 1s for 10 minutes AND error rate > 5% for 5 minutes”
This indicates user impact. It requires action.
Alert Fatigue is Real
If your team ignores pages, your alerts are wrong. Common mistakes:
- Too sensitive: Alerts fire constantly for non-issues
- Too late: By the time it alerts, customers have already complained
- Not actionable: “Something is wrong” with no guidance on what to do
Fix this:
- Every alert must have a runbook
- Review alerts monthly—delete ones that never lead to action
- Use warning vs. critical tiers
Observability for Business Metrics
Don’t just monitor infrastructure—track business KPIs:
# Track business events
revenue_counter = Counter('revenue_total', 'Total revenue in cents', ['currency'])
signup_counter = Counter('user_signups_total', 'Total user signups', ['plan_type'])
checkout_duration = Histogram('checkout_duration_seconds', 'Time to complete checkout')
# Use them
revenue_counter.labels(currency='USD').inc(4999) # $49.99
signup_counter.labels(plan_type='premium').inc()
Now you can alert on “signups dropped 50% in the last hour”—a business problem, not just a technical one.
Common Pitfalls
Over-instrumenting: Don’t track everything. Focus on what matters.
Under-sampling: Sampling is fine for traces, but not for error logs—you want every error.
Ignoring cardinality: Metrics with user IDs or session IDs explode your monitoring costs.
No retention strategy: Logs and traces are expensive. Keep high-resolution data for 7-30 days, then aggregate or delete.
The Payoff
With good observability:
- MTTR (mean time to recovery) drops from hours to minutes
- You catch issues before customers report them
- Debugging shifts from “guess and test” to “look and fix”
- You make data-driven infrastructure decisions
One client reduced incident response time from 90 minutes to 12 minutes after implementing structured observability.
That’s the difference between a 3-hour outage and a 15-minute blip.
Start Simple
You don’t need a perfect observability setup. Start with:
- Structured logging
- The four golden signals
- One dashboard everyone looks at
Build from there. Observability is a practice, not a project.