Monitoring and Logging

Introduction

Modern infrastructure requires comprehensive monitoring and logging solutions to ensure reliability, performance, and security. This guide covers the essential tools and practices for implementing effective monitoring and logging systems.

Core Components:

Metrics collection
Log aggregation
Visualization
Alerting
Anomaly detection
Performance analysis

Prometheus

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app:8080']

PromQL Examples

# CPU Usage
rate(node_cpu_seconds_total{mode="user"}[5m])

# Memory Usage
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) 
  / node_memory_MemTotal_bytes * 100

# HTTP Request Rate
rate(http_requests_total[5m])

# 95th Percentile Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) 
  by (le, handler))

# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
  / sum(rate(http_requests_total[5m])) * 100

Grafana

Dashboard Configuration

{
  "dashboard": {
    "id": null,
    "title": "Application Dashboard",
    "tags": ["application", "production"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status)",
            "legendFormat": "{{status}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "datasource": "Prometheus",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "options": {
          "thresholds": [
            { "value": 1, "color": "green" },
            { "value": 5, "color": "orange" },
            { "value": 10, "color": "red" }
          ]
        }
      }
    ]
  }
}

ELK Stack

Filebeat Configuration

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/access.log
  fields:
    type: nginx-access
  json.keys_under_root: true

- type: log
  enabled: true
  paths:
    - /var/log/application/*.log
  multiline:
    pattern: '^[[:space:]]+(at|\\.{3})[[:space:]]+\b|^Caused by:'
    negate: false
    match: after

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "logs-%{[fields.type]}-%{+yyyy.MM.dd}"

setup.kibana:
  host: "kibana:5601"

Logstash Pipeline

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][type] == "nginx-access" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
    geoip {
      source => "clientip"
    }
  }

  if [fields][type] == "application" {
    json {
      source => "message"
    }
    date {
      match => [ "timestamp", "ISO8601" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "%{[@metadata][beat]}-%{[fields][type]}-%{+YYYY.MM.dd}"
  }
}

Alerting

Alert Rules

# alerting_rules.yml
groups:
- name: application
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High HTTP error rate
      description: Error rate is above 5% for 5 minutes

  - alert: HighLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High latency detected
      description: 95th percentile latency is above 2 seconds

  - alert: HighCPUUsage
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: High CPU usage
      description: CPU usage is above 80% for 10 minutes

Alert Manager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pager-duty'
    continue: true

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    send_resolved: true
    title: '{{ template "slack.default.title" . }}'
    text: '{{ template "slack.default.text" . }}'

- name: 'pager-duty'
  pagerduty_configs:
  - service_key: 'your-pagerduty-service-key'

Best Practices

Monitoring:

Use the USE method (Utilization, Saturation, Errors)
Implement the RED method (Rate, Errors, Duration)
Set appropriate thresholds
Monitor business metrics
Use service level indicators (SLIs)
Track service level objectives (SLOs)

Logging:

Structured logging format
Consistent log levels
Include context information
Implement log rotation
Use correlation IDs
Set retention policies

Alerting:

Alert on symptoms, not causes
Minimize alert fatigue
Define clear escalation paths
Document alert response procedures
Regular alert review and tuning
Test alerting system