告警规则与 On-Call 运营
High Contrast
Dark Mode
Light Mode
Sepia
Forest
2 min read335 words

告警规则与 On-Call 运营

好的告警系统是团队信任的基础。告警太少,故障悄悄发生;告警太多,没人再当真。本节覆盖 Prometheus 告警规则的写法、Alertmanager 路由,以及支撑 On-Call 的运营实践。

告警到响应的完整流程

graph LR PROM["Prometheus\n评估告警规则"] -->|触发告警| AM["Alertmanager\n分组 / 去重 / 路由"] AM -->|Critical| PAGE["PagerDuty / 手机\n立即呼叫值班人员"] AM -->|Warning| SLACK["Slack #ops-alerts\n工作时间处理"] AM -->|Info| EMAIL["Email 归档\n无需立即响应"] PAGE --> RUNBOOK["运行手册\n(Runbook)"] RUNBOOK --> FIX["修复问题"] FIX --> POSTMORTEM["事后复盘\n(Postmortem)"]

Prometheus 告警规则文件

# alerts.yml — 放在 Prometheus 配置目录
groups:
- name: application
interval: 30s
rules:
# ===== 服务不可用 =====
- alert: ServiceDown
expr: up{job="myapp"} == 0
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "{{ $labels.instance }} 服务不可用"
description: "{{ $labels.job }} 连续 2 分钟无法抓取指标,请立即检查"
runbook_url: "https://wiki.example.com/runbooks/service-down"
# ===== 高错误率 =====
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "HTTP 5xx 错误率超过 5%"
description: "当前错误率 {{ $value | humanizePercentage }}"
# ===== P99 延迟过高 =====
- alert: HighP99Latency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "P99 延迟超过 2 秒"
- name: infrastructure
rules:
# ===== 磁盘快满 =====
- alert: DiskSpaceLow
expr: |
(node_filesystem_free_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 0m
labels:
severity: warning
annotations:
summary: "磁盘剩余空间 < 15%"
description: "{{ $labels.instance }} 磁盘使用率 {{ $value | humanize }}%"
# ===== 内存高压 =====
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "内存使用率超过 90%"

Alertmanager 路由配置

# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
group_by: ['alertname', 'team']
group_wait: 30s        # 首次告警等待30秒,合并同组告警
group_interval: 5m     # 同组告警每5分钟发一次汇总
repeat_interval: 4h    # 持续未恢复时每4小时重复一次
receiver: 'slack-default'
routes:
- matchers:
- severity="critical"
receiver: 'pagerduty-critical'
repeat_interval: 1h
- matchers:
- severity="warning"
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PD_INTEGRATION_KEY'
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#ops-alerts'
title: '⚠️ {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'slack-default'
slack_configs:
- channel: '#ops-general'

On-Call 运营最佳实践

实践 说明
Runbook 先于告警 每条告警必须有配套的排查步骤文档
每周告警回顾 统计本周哪些告警噪音最高,并关闭 / 降级
告警分级明确 Critical = 现在叫醒人;Warning = 工作时间处理;Info = 不要发通知
静默(Silence)规范 维护时间窗口提前设置静默,而不是关掉告警
事后复盘(Postmortem) 每次 Severity 1 事故后 48 小时内输出不指责的复盘文档

简易值班轮换脚本

#!/bin/bash
# oncall-check.sh —— 每日运行,确认值班人员
ONCALL_LIST=("alice" "bob" "charlie")
WEEK_OF_YEAR=$(date +%V)
INDEX=$(( WEEK_OF_YEAR % ${#ONCALL_LIST[@]} ))
CURRENT=$(echo "${ONCALL_LIST[$INDEX]}")
echo "本周值班工程师: $CURRENT"
echo "Slack: @$CURRENT"

本节执行清单