Prometheus 采集与存储原理
High Contrast
Dark Mode
Light Mode
Sepia
Forest
1 min read259 words

Prometheus 采集与存储原理

基础监控(第9章)告诉你"服务器挂了"。可观测性体系告诉你"为什么挂、哪里慢、哪个请求出了问题"。Prometheus 是现代 DevOps 可观测性的事实标准,它采用拉取(pull)模型主动抓取指标。

Prometheus 架构全景

graph TB subgraph Targets ["被监控对象"] APP["应用 /metrics 端点\n(Node.js / Python / Go)"] NODE["node_exporter\n(主机 CPU/内存/磁盘)"] NGINX["nginx-prometheus-exporter\n(请求数/错误率)"] PG["postgres_exporter\n(连接数/查询延迟)"] end subgraph Prometheus ["Prometheus Server"] SCRAPE["Scrape Job\n定时抓取 /metrics"] TSDB["TSDB 时序数据库\n本地磁盘存储 15d"] RULE["Recording Rules\n预计算复杂查询"] ALERT["Alerting Rules\n条件触发告警"] end AM["Alertmanager\n分组/静默/路由告警"] SLACK["Slack / Email\n/ PagerDuty"] GRAFANA["Grafana\n可视化仪表盘"] Targets -->|HTTP pull| SCRAPE SCRAPE --> TSDB TSDB --> RULE TSDB --> ALERT TSDB --> GRAFANA ALERT --> AM AM --> SLACK

安装 Prometheus Stack(推荐方式)

# 方法 A:Docker Compose(单机快速启动)
cat << 'EOF' > docker-compose-monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
grafana:
image: grafana/grafana:10.3.0
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
node_exporter:
image: prom/node-exporter:v1.7.0
network_mode: host
pid: host
volumes:
- '/:/host:ro,rslave'
command:
- '--path.rootfs=/host'
volumes:
prometheus_data:
grafana_data:
EOF
docker compose -f docker-compose-monitoring.yml up -d
# 方法 B:Kubernetes(kube-prometheus-stack Helm chart)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.adminPassword=changeme

prometheus.yml 基础配置

# prometheus.yml
global:
scrape_interval: 15s     # 每15秒抓取一次指标
evaluation_interval: 15s # 每15秒评估一次告警规则
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# 监控 Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 监控主机指标
- job_name: 'node'
static_configs:
- targets: ['node_exporter:9100']
labels:
env: 'production'
host: 'web-01'
# 监控应用(应用需要暴露 /metrics 端点)
- job_name: 'myapp'
static_configs:
- targets: ['myapp:8080']
metrics_path: '/metrics'

四种指标类型

类型 用途 示例
Counter 只增不减的累计计数 HTTP 请求总数、错误次数
Gauge 可上可下的瞬时值 CPU 使用率、当前连接数
Histogram 带桶分布的统计(耗时) 请求延迟(P50/P95/P99)
Summary 类似 Histogram,客户端计算分位数 较少用,Histogram 更灵活

本节执行清单

下一节:Grafana 仪表盘设计实战