Advanced DevOps

Monitoring & Observability

Logging, metrics, alerting, and debugging production systems.

Understanding Production

Observability helps you understand what's happening in your systems through logs, metrics, and traces.

Three Pillars

  • Logs: Event records with context
  • Metrics: Numerical measurements over time
  • Traces: Request flows across services

Key Metrics

  • Latency: Response time (p50, p95, p99)
  • Traffic: Requests per second
  • Errors: Error rate percentage
  • Saturation: Resource utilization

Alerting Best Practices

  • Alert on symptoms, not causes
  • Avoid alert fatigue
  • Have runbooks for each alert
  • Use severity levels appropriately

Tools Stack

  • Metrics: Prometheus, Datadog
  • Logs: Loki, Elasticsearch
  • Traces: Jaeger, OpenTelemetry
  • Dashboards: Grafana