Pod 生命周期与探针
核心问题:Kubernetes 的最小调度单元是 Pod——Pod 怎样从创建到运行,探针怎样让 Kubernetes 知道你的应用是否健康?
Pod 是什么
Pod 是 Kubernetes 的最小部署单元,包含一个或多个紧密相关的容器,共享:
- 网络命名空间(同一 Pod 内容器通过 localhost 通信)
- 存储卷(可挂载相同的 Volume)
- 生命周期(一起启动,一起终止)
# 最简 Pod(实际使用中几乎不直接创建 Pod,而是通过 Deployment)
apiVersion: v1
kind: Pod
metadata:
name: my-app
namespace: default
labels:
app: my-app
version: v1.0
spec:
containers:
- name: app
image: nginx:1.25
ports:
- containerPort: 80
Pod 生命周期状态
stateDiagram-v2
[*] --> Pending: kubectl apply
Pending --> Running: 镜像拉取成功 + 容器启动
Pending --> Failed: 镜像拉取失败 / 资源不足
Running --> Succeeded: 容器正常退出(exit 0)
Running --> Failed: 容器异常退出
Running --> Unknown: 节点失联
Succeeded --> [*]
Failed --> [*]
| 状态 | 说明 |
|---|---|
Pending | Pod 已接受,等待调度或镜像拉取 |
Running | 至少一个容器正在运行 |
Succeeded | 所有容器以 exit code 0 结束(Job 常见) |
Failed | 至少一个容器以非 0 退出且不再重启 |
Unknown | 无法获取 Pod 状态(节点通信问题) |
容器生命周期钩子
spec:
containers:
- name: app
image: myapp:1.0
lifecycle:
postStart: # 容器启动后立即执行(异步,不阻塞启动)
exec:
command: ["/bin/sh", "-c", "echo 'started' > /tmp/started"]
preStop: # 容器停止前执行(同步,阻塞删除)
exec:
command: ["/bin/sh", "-c", "sleep 15"] # 优雅退出等待时间
三种探针(Probe)
探针是 Kubernetes 对应用进行健康检查的机制:
graph LR
L[livenessProbe
存活探针] -->|失败| RESTART[重启容器] R[readinessProbe
就绪探针] -->|失败| REMOVEEP[从 Service 摘除] S[startupProbe
启动探针] -->|未通过| BLOCK[阻止其他探针运行] S -->|通过| L S -->|通过| R
存活探针] -->|失败| RESTART[重启容器] R[readinessProbe
就绪探针] -->|失败| REMOVEEP[从 Service 摘除] S[startupProbe
启动探针] -->|未通过| BLOCK[阻止其他探针运行] S -->|通过| L S -->|通过| R
livenessProbe:存活探针
检测容器是否还在正常运行(如检测死锁)。失败则重启容器:
livenessProbe:
httpGet:
path: /healthz
port: 3000
httpHeaders:
- name: X-Custom-Header
value: liveness-check
initialDelaySeconds: 30 # 容器启动后等待 30 秒再开始探测
periodSeconds: 10 # 每 10 秒探测一次
timeoutSeconds: 5 # 探测超时时间
failureThreshold: 3 # 连续失败 3 次才重启(避免抖动)
successThreshold: 1 # 成功 1 次即认为恢复
readinessProbe:就绪探针
检测容器是否准备好接收流量。失败则从 Service Endpoints 中移除(不重启容器):
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
startupProbe:启动探针(慢启动应用必备)
给启动慢的应用(如 JVM 预热、数据库连接池初始化)额外时间:
startupProbe:
httpGet:
path: /healthz
port: 3000
failureThreshold: 30 # 最多失败 30 次
periodSeconds: 10 # 每 10 秒探测 → 最长等待 300 秒
启动探针通过后,liveness 和 readiness 探针才开始运行。
探针的三种检测方式
# 1. HTTP GET(最常用)
livenessProbe:
httpGet:
path: /healthz
port: 3000
scheme: HTTP
# 2. TCP Socket(数据库、消息队列等)
livenessProbe:
tcpSocket:
port: 5432
# 3. Exec(执行命令,exit 0 = 健康)
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "redis-cli ping | grep PONG"
完整的 Pod Spec 最佳实践
apiVersion: v1
kind: Pod
metadata:
name: myapp
labels:
app: myapp
version: v1.2.3
spec:
# 初始化容器(在主容器启动前运行)
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z db-service 5432; do echo waiting; sleep 2; done']
containers:
- name: app
image: myapp:1.2.3
imagePullPolicy: IfNotPresent # Always | IfNotPresent | Never
ports:
- name: http
containerPort: 3000
protocol: TCP
# 资源申请和限制(必须设置!)
resources:
requests:
cpu: "100m" # 0.1 CPU
memory: "128Mi"
limits:
cpu: "500m" # 0.5 CPU
memory: "512Mi"
# 环境变量
env:
- name: NODE_ENV
value: "production"
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: db_host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secrets
key: db_password
# 探针
startupProbe:
httpGet:
path: /healthz
port: http
failureThreshold: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
# 优雅停止
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# 挂载卷
volumeMounts:
- name: config
mountPath: /etc/app
readOnly: true
- name: tmp
mountPath: /tmp
# 终止宽限期(必须 > preStop 时间)
terminationGracePeriodSeconds: 30
volumes:
- name: config
configMap:
name: app-config
- name: tmp
emptyDir: {}
# 节点选择
nodeSelector:
node-role: worker
# 反亲和(多副本分散到不同节点)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [myapp]
topologyKey: kubernetes.io/hostname
常见错误排查
# 查看 Pod 状态
kubectl get pods -n production
# 查看 Pod 事件(失败原因)
kubectl describe pod myapp-xxxx -n production
# 查看容器日志
kubectl logs myapp-xxxx -n production
kubectl logs myapp-xxxx -n production --previous # 上一次崩溃的日志
# 进入容器调试
kubectl exec -it myapp-xxxx -n production -- /bin/sh
| 状态 | 常见原因 | 诊断方法 |
|---|---|---|
ImagePullBackOff | 镜像不存在或凭据错误 | kubectl describe pod 查看 Events |
CrashLoopBackOff | 容器启动后崩溃,K8s 反复重启 | kubectl logs --previous 查看崩溃原因 |
Pending 卡住 | 资源不足(CPU/内存/PVC)或节点选择器不匹配 | kubectl describe pod 查看 Events,kubectl describe node 查看资源 |
OOMKilled | 内存超过 limits,被 OOM Killer 杀死 | 增加 limits.memory 或优化内存使用 |
| Readiness 失败 | 应用未通过就绪检查,Traffic 不进来 | 检查 /ready 接口是否正常返回 200 |