网络技术

Prometheus + Grafana 监控体系搭建:从指标采集到告警通知的完整方案

✎ -- 字 🕐 -- 分钟
字号
Prometheus × Grafana 监控体系

为什么选择 P+G?

Prometheus 是 CNCF 毕业项目,采用 Pull 模型采集指标,配合 Grafana 强大的可视化能力,已成为云原生监控的事实标准。

1. 快速部署

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command: --config.file=/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:
  grafana_data:

2. Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'myapp'
    static_configs:
      - targets: ['app:8080']

3. 应用埋点

// Node.js 应用 Prometheus 指标
const prometheus = require('prom-client');

const httpRequestsTotal = new prometheus.Counter({
    name: 'http_requests_total',
    help: 'Total HTTP requests',
    labelNames: ['method', 'route', 'status'],
});

const httpDuration = new prometheus.Histogram({
    name: 'http_request_duration_seconds',
    help: 'HTTP request duration',
    labelNames: ['method', 'route'],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});

// 中间件
app.use((req, res, next) => {
    const end = httpDuration.startTimer();
    res.on('finish', () => {
        httpRequestsTotal.inc({ method: req.method, route: req.path, status: res.statusCode });
        end({ method: req.method, route: req.path });
    });
    next();
});

// 暴露指标端点
app.get('/metrics', async (req, res) => {
    res.set('Content-Type', prometheus.register.contentType);
    res.end(await prometheus.register.metrics());
});

4. 告警规则

groups:
- name: node_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU 使用率 > 80%"

  - alert: DiskSpaceLow
    expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "磁盘空间不足 10%"

5. Grafana 面板导入

推荐面板 ID:Node Exporter Full(1860)、Docker Monitoring(193)、NGINX(11190)。一键导入即可获得专业级监控面板。

总结

Prometheus + Grafana 是性价比最高的开源监控方案。从部署到应用埋点再到告警通知,掌握这套体系能极大提升系统可观测性。