Monitoring and Logging

Introduction

Modern infrastructure requires comprehensive monitoring and logging solutions to ensure reliability, performance, and security. This guide covers the essential tools and practices for implementing effective monitoring and logging systems.

Core Components:

  • Metrics collection
  • Log aggregation
  • Visualization
  • Alerting
  • Anomaly detection
  • Performance analysis

Prometheus

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app:8080']

PromQL Examples

# CPU Usage
rate(node_cpu_seconds_total{mode="user"}[5m])

# Memory Usage
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) 
  / node_memory_MemTotal_bytes * 100

# HTTP Request Rate
rate(http_requests_total[5m])

# 95th Percentile Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) 
  by (le, handler))

# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
  / sum(rate(http_requests_total[5m])) * 100

Grafana

Dashboard Configuration

{
  "dashboard": {
    "id": null,
    "title": "Application Dashboard",
    "tags": ["application", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status)",
            "legendFormat": "{{status}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "options": {
          "thresholds": [
            { "value": 1, "color": "green" },
            { "value": 5, "color": "orange" },
            { "value": 10, "color": "red" }
          ]
        }
      }
    ]
  }
}

ELK Stack

Filebeat Configuration

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/access.log
  fields:
    type: nginx-access
  json.keys_under_root: true

- type: log
  enabled: true
  paths:
    - /var/log/application/*.log
  multiline:
    pattern: '^[[:space:]]+(at|\\.{3})[[:space:]]+\b|^Caused by:'
    negate: false
    match: after

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "logs-%{[fields.type]}-%{+yyyy.MM.dd}"

setup.kibana:
  host: "kibana:5601"

Logstash Pipeline

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][type] == "nginx-access" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    geoip {
      source => "clientip"
    }
  }

  if [fields][type] == "application" {
    json {
      source => "message"
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "%{[@metadata][beat]}-%{[fields][type]}-%{+YYYY.MM.dd}"
  }
}

Alerting

Alert Rules

# alerting_rules.yml
groups:
- name: application
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High HTTP error rate
      description: Error rate is above 5% for 5 minutes

  - alert: HighLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High latency detected
      description: 95th percentile latency is above 2 seconds

  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: High CPU usage
      description: CPU usage is above 80% for 10 minutes

Alert Manager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pager-duty'
    continue: true

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    send_resolved: true
    title: '{{ template "slack.default.title" . }}'
    text: '{{ template "slack.default.text" . }}'

- name: 'pager-duty'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'

Best Practices

Monitoring:

  • Use the USE method (Utilization, Saturation, Errors)
  • Implement the RED method (Rate, Errors, Duration)
  • Set appropriate thresholds
  • Monitor business metrics
  • Use service level indicators (SLIs)
  • Track service level objectives (SLOs)

Logging:

  • Structured logging format
  • Consistent log levels
  • Include context information
  • Implement log rotation
  • Use correlation IDs
  • Set retention policies

Alerting:

  • Alert on symptoms, not causes
  • Minimize alert fatigue
  • Define clear escalation paths
  • Document alert response procedures
  • Regular alert review and tuning
  • Test alerting system