1 min read260 words

Grafana 仪表盘设计实战

Grafana 把 Prometheus 的数字变成直觉——一眼就知道系统在哪里有问题。好的仪表盘不是堆砌指标，而是围绕"用户是否正常使用系统"这个核心问题来设计。

仪表盘设计层次

graph TB L1["Layer 1: 业务层\n每分钟订单数 / 注册转化率\n搜索成功率 / 支付完成率"] --> L2 L2["Layer 2: 应用层\nHTTP 错误率 / P99 延迟\n请求速率 / 队列长度"] --> L3 L3["Layer 3: 基础设施层\nCPU / 内存 / 磁盘 IO\n网络带宽 / 连接数"] --> L4 L4["Layer 4: 外部依赖层\n数据库查询延迟\n三方 API 响应时间\nCDN 命中率"] style L1 fill:#C8E6C9 style L2 fill:#B3E5FC style L3 fill:#FFF9C4 style L4 fill:#FFCCBC

设计原则：从 L1 开始，出问题时向下钻取，而不是把四层全部堆在第一块屏幕上。

核心 PromQL 查询速查

# ===== CPU =====
# 所有 CPU 核心使用率（平均，排除 idle）
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# ===== 内存 =====
# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# ===== HTTP 指标（Nginx/应用 Prometheus 客户端）=====
# 请求速率（每秒）
rate(http_requests_total[5m])
# 错误率（4xx + 5xx 占比）
sum(rate(http_requests_total{status=~"[45].."}[5m]))
/
sum(rate(http_requests_total[5m]))
# P99 延迟（Histogram 指标）
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# ===== 磁盘 =====
# 磁盘使用率
100 - (node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# ===== 数据库（PostgreSQL exporter）=====
# 活跃连接数
pg_stat_activity_count{state="active"}

推荐仪表盘布局（标准 Row 结构）

┌─────────────────────────────────────────────────────────┐
│  Row 1: 服务摘要（Status Row）                           │
│  [当前错误率]  [P99 延迟]  [每秒请求数]  [系统健康]     │
│  （4个 Stat 面板，颜色阈值：绿/橙/红）                  │
├─────────────────────────────────────────────────────────┤
│  Row 2: 请求流量                                        │
│  [Time Series: 请求速率 vs 错误率，双 Y 轴]             │
├─────────────────────────────────────────────────────────┤
│  Row 3: 延迟分布                                        │
│  [Time Series: P50 / P90 / P99 延迟叠加图]             │
├─────────────────────────────────────────────────────────┤
│  Row 4: 基础设施                                        │
│  [CPU%] [内存%] [磁盘%] [网络 in/out]                   │
└─────────────────────────────────────────────────────────┘

快速导入社区仪表盘

# 不需要从零做，Grafana 官方社区有数百个预制仪表盘
# 常用仪表盘 ID（在 Grafana → Import → 输入 ID）：
# 1860  → Node Exporter Full（主机监控完整版）
# 6417  → Kubernetes Cluster（K8s 集群概览）
# 9628  → PostgreSQL Database
# 12708 → Nginx Ingress Controller
# 3662  → Prometheus 2.0 Overview（监控监控本身）
# 命令行导入（通过 API）
curl -X POST http://admin:changeme@localhost:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-d '{"dashboard": {...}, "overwrite": true}'

Dashboard as Code（防止仪表盘随意被修改）

# 导出仪表盘 JSON（便于 Git 版本管理）
curl -s "http://admin:changeme@localhost:3000/api/dashboards/uid/myapp" \
| jq '.dashboard' > dashboards/myapp-dashboard.json
# 推荐工具：Grizzly（Grafana as Code）
# https://grafana.github.io/grizzly/
grr apply dashboards/

本节执行清单

[ ] 导入 ID 1860（Node Exporter Full）仪表盘，浏览所有面板
[ ] 创建一个自定义仪表盘，添加 CPU 使用率和内存使用率两个面板
[ ] 设置面板颜色阈值（CPU > 80% 变橙，> 90% 变红）
[ ] 将仪表盘 JSON 导出并提交到 Git

下一节：告警规则与 On-Call 运营