kube-prometheus-stack 部署
High Contrast
Dark Mode
Light Mode
Sepia
Forest
2 min read303 words

kube-prometheus-stack 部署

核心问题:集群上线后怎样快速建立完整的监控体系——指标收集、可视化、告警一体化?


可观测性三支柱

graph LR METRICS["📊 Metrics(指标)
Prometheus + Grafana
时序数据,趋势分析"] LOGS["📋 Logs(日志)
Loki / ELK
事件详情,问题调查"] TRACES["🔍 Traces(链路追踪)
Jaeger / Tempo
请求路径,性能分析"] METRICS --> OBS["可观测性"] LOGS --> OBS TRACES --> OBS

kube-prometheus-stack 包含什么

一个 Helm Chart 安装整套生产级监控栈:

组件 职责 版本
Prometheus 指标收集与存储 2.48.x
Alertmanager 告警路由与静默 0.26.x
Grafana 可视化仪表板 10.x
kube-state-metrics K8s 对象状态指标 2.10.x
node-exporter 节点硬件/OS 指标 1.7.x
Prometheus Operator 声明式管理 Prometheus 配置 0.70.x

安装 kube-prometheus-stack

# 添加仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 查看默认 values
helm show values prometheus-community/kube-prometheus-stack > kps-values.yaml
# kps-values.yaml(生产配置)
grafana:
enabled: true
adminPassword: "ChangeMe_In_Production"  # 实际用 Secret
persistence:
enabled: true
size: 10Gi
storageClassName: gp3
ingress:
enabled: true
ingressClassName: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- grafana.example.com
tls:
- secretName: grafana-tls
hosts: [grafana.example.com]
sidecar:
datasources:
enabled: true
prometheus:
prometheusSpec:
retention: 30d             # 保留 30 天数据
retentionSize: "50GB"      # 超过 50GB 删除旧数据
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
# 采集所有 Namespace 中的 ServiceMonitor
serviceMonitorSelectorNilUsesHelmValues: false
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
# 资源限制
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 2
memory: 4Gi
alertmanager:
alertmanagerSpec:
retention: 120h
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 5Gi
# 节点级指标(DaemonSet)
nodeExporter:
enabled: true
# K8s 对象状态指标
kubeStateMetrics:
enabled: true
# 安装
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--version "55.0.0" \
-f kps-values.yaml \
--wait

ServiceMonitor:声明式指标采集

ServiceMonitor 告诉 Prometheus 采集哪个 Service 的 /metrics 端点:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-api-monitor
namespace: production
labels:
release: kube-prometheus-stack   # 必须匹配 Prometheus 的 serviceMonitorSelector
spec:
selector:
matchLabels:
app: api-server                # 选择 Service
namespaceSelector:
matchNames:
- production
endpoints:
- port: http                     # Service 中的端口名称
path: /metrics
interval: 30s
scrapeTimeout: 10s
scheme: http

应用侧暴露 Prometheus 指标(Node.js 示例)

// Node.js 应用集成 Prometheus 客户端
const client = require('prom-client');
const express = require('express');
const register = new client.Registry();
client.collectDefaultMetrics({ register });
// 自定义指标
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP 请求耗时',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register]
});
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'HTTP 请求总数',
labelNames: ['method', 'route', 'status_code'],
registers: [register]
});
// Middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const labels = { method: req.method, route: req.path, status_code: res.statusCode };
end(labels);
httpRequestsTotal.inc(labels);
});
next();
});
// 指标端点
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});

关键内置仪表板

安装后 Grafana 自带以下仪表板:

仪表板 内容
Kubernetes / Cluster 集群 CPU/内存/Pod 总览
Kubernetes / Namespaces 按 Namespace 资源使用
Kubernetes / Pods 单个 Pod CPU/内存/网络
Kubernetes / Nodes 节点 CPU/内存/磁盘/网络
Node Exporter Full 完整节点指标

查看 Prometheus 状态

# 端口转发访问 Prometheus UI
kubectl port-forward svc/kube-prometheus-stack-prometheus \
9090:9090 -n monitoring
# 端口转发访问 Alertmanager UI
kubectl port-forward svc/kube-prometheus-stack-alertmanager \
9093:9093 -n monitoring
# 查看 Prometheus Targets(哪些 ServiceMonitor 正在采集)
# 浏览器访问 http://localhost:9090/targets
# 查看 Prometheus 配置
kubectl get secret prometheus-kube-prometheus-stack-prometheus \
-n monitoring -o json | \
jq -r '.data["prometheus.yaml.gz"]' | \
base64 -d | gunzip

长期存储:Thanos / VictoriaMetrics

Prometheus 本地存储有限,生产环境推荐对接长期存储:

# kps-values.yaml(追加 Thanos Sidecar)
prometheus:
prometheusSpec:
thanos:
objectStorageConfig:
key: objstore.yml
name: thanos-objstore-config
# Thanos Sidecar 上传数据到 S3
# thanos-objstore-config Secret
type: S3
config:
bucket: my-prometheus-data
endpoint: s3.ap-southeast-1.amazonaws.com
region: ap-southeast-1
# 使用 IAM Role,不需要 Access Key

下一节PromQL 基础查询