Monitoring and Logging
Introduction
Modern infrastructure requires comprehensive monitoring and logging solutions to ensure reliability, performance, and security. This guide covers the essential tools and practices for implementing effective monitoring and logging systems.
Core Components:
- Metrics collection
- Log aggregation
- Visualization
- Alerting
- Anomaly detection
- Performance analysis
Prometheus
Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
metrics_path: '/metrics'
static_configs:
- targets: ['app:8080']
PromQL Examples
# CPU Usage
rate(node_cpu_seconds_total{mode="user"}[5m])
# Memory Usage
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes)
/ node_memory_MemTotal_bytes * 100
# HTTP Request Rate
rate(http_requests_total[5m])
# 95th Percentile Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m]))
by (le, handler))
# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
Grafana
Dashboard Configuration
{
"dashboard": {
"id": null,
"title": "Application Dashboard",
"tags": ["application", "production"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status)",
"legendFormat": "{{status}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"options": {
"thresholds": [
{ "value": 1, "color": "green" },
{ "value": 5, "color": "orange" },
{ "value": 10, "color": "red" }
]
}
}
]
}
}
ELK Stack
Filebeat Configuration
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
fields:
type: nginx-access
json.keys_under_root: true
- type: log
enabled: true
paths:
- /var/log/application/*.log
multiline:
pattern: '^[[:space:]]+(at|\\.{3})[[:space:]]+\b|^Caused by:'
negate: false
match: after
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "logs-%{[fields.type]}-%{+yyyy.MM.dd}"
setup.kibana:
host: "kibana:5601"
Logstash Pipeline
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [fields][type] == "nginx-access" {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
geoip {
source => "clientip"
}
}
if [fields][type] == "application" {
json {
source => "message"
}
date {
match => [ "timestamp", "ISO8601" ]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[@metadata][beat]}-%{[fields][type]}-%{+YYYY.MM.dd}"
}
}
Alerting
Alert Rules
# alerting_rules.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: High HTTP error rate
description: Error rate is above 5% for 5 minutes
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 5m
labels:
severity: warning
annotations:
summary: High latency detected
description: 95th percentile latency is above 2 seconds
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: High CPU usage
description: CPU usage is above 80% for 10 minutes
Alert Manager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pager-duty'
continue: true
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
- name: 'pager-duty'
pagerduty_configs:
- service_key: 'your-pagerduty-service-key'
Best Practices
Monitoring:
- Use the USE method (Utilization, Saturation, Errors)
- Implement the RED method (Rate, Errors, Duration)
- Set appropriate thresholds
- Monitor business metrics
- Use service level indicators (SLIs)
- Track service level objectives (SLOs)
Logging:
- Structured logging format
- Consistent log levels
- Include context information
- Implement log rotation
- Use correlation IDs
- Set retention policies
Alerting:
- Alert on symptoms, not causes
- Minimize alert fatigue
- Define clear escalation paths
- Document alert response procedures
- Regular alert review and tuning
- Test alerting system