Pod 生命周期与探针
High Contrast
Dark Mode
Light Mode
Sepia
Forest
3 min read511 words

Pod 生命周期与探针

核心问题:Kubernetes 的最小调度单元是 Pod——Pod 怎样从创建到运行,探针怎样让 Kubernetes 知道你的应用是否健康?


Pod 是什么

Pod 是 Kubernetes 的最小部署单元,包含一个或多个紧密相关的容器,共享: - 网络命名空间(同一 Pod 内容器通过 localhost 通信) - 存储卷(可挂载相同的 Volume) - 生命周期(一起启动,一起终止)

# 最简 Pod(实际使用中几乎不直接创建 Pod,而是通过 Deployment)
apiVersion: v1
kind: Pod
metadata:
name: my-app
namespace: default
labels:
app: my-app
version: v1.0
spec:
containers:
- name: app
image: nginx:1.25
ports:
- containerPort: 80

Pod 生命周期状态

stateDiagram-v2 [*] --> Pending: kubectl apply Pending --> Running: 镜像拉取成功 + 容器启动 Pending --> Failed: 镜像拉取失败 / 资源不足 Running --> Succeeded: 容器正常退出(exit 0) Running --> Failed: 容器异常退出 Running --> Unknown: 节点失联 Succeeded --> [*] Failed --> [*]
状态 说明
Pending Pod 已接受,等待调度或镜像拉取
Running 至少一个容器正在运行
Succeeded 所有容器以 exit code 0 结束(Job 常见)
Failed 至少一个容器以非 0 退出且不再重启
Unknown 无法获取 Pod 状态(节点通信问题)

容器生命周期钩子

spec:
containers:
- name: app
image: myapp:1.0
lifecycle:
postStart:              # 容器启动后立即执行(异步,不阻塞启动)
exec:
command: ["/bin/sh", "-c", "echo 'started' > /tmp/started"]
preStop:                # 容器停止前执行(同步,阻塞删除)
exec:
command: ["/bin/sh", "-c", "sleep 15"]  # 优雅退出等待时间

三种探针(Probe)

探针是 Kubernetes 对应用进行健康检查的机制:

graph LR L[livenessProbe
存活探针] -->|失败| RESTART[重启容器] R[readinessProbe
就绪探针] -->|失败| REMOVEEP[从 Service 摘除] S[startupProbe
启动探针] -->|未通过| BLOCK[阻止其他探针运行] S -->|通过| L S -->|通过| R

livenessProbe:存活探针

检测容器是否还在正常运行(如检测死锁)。失败则重启容器:

livenessProbe:
httpGet:
path: /healthz
port: 3000
httpHeaders:
- name: X-Custom-Header
value: liveness-check
initialDelaySeconds: 30    # 容器启动后等待 30 秒再开始探测
periodSeconds: 10          # 每 10 秒探测一次
timeoutSeconds: 5          # 探测超时时间
failureThreshold: 3        # 连续失败 3 次才重启(避免抖动)
successThreshold: 1        # 成功 1 次即认为恢复

readinessProbe:就绪探针

检测容器是否准备好接收流量。失败则从 Service Endpoints 中移除(不重启容器):

readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1

startupProbe:启动探针(慢启动应用必备)

给启动慢的应用(如 JVM 预热、数据库连接池初始化)额外时间:

startupProbe:
httpGet:
path: /healthz
port: 3000
failureThreshold: 30       # 最多失败 30 次
periodSeconds: 10          # 每 10 秒探测 → 最长等待 300 秒

启动探针通过后,liveness 和 readiness 探针才开始运行。


探针的三种检测方式

# 1. HTTP GET(最常用)
livenessProbe:
httpGet:
path: /healthz
port: 3000
scheme: HTTP
# 2. TCP Socket(数据库、消息队列等)
livenessProbe:
tcpSocket:
port: 5432
# 3. Exec(执行命令,exit 0 = 健康)
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "redis-cli ping | grep PONG"

完整的 Pod Spec 最佳实践

apiVersion: v1
kind: Pod
metadata:
name: myapp
labels:
app: myapp
version: v1.2.3
spec:
# 初始化容器(在主容器启动前运行)
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z db-service 5432; do echo waiting; sleep 2; done']
containers:
- name: app
image: myapp:1.2.3
imagePullPolicy: IfNotPresent   # Always | IfNotPresent | Never
ports:
- name: http
containerPort: 3000
protocol: TCP
# 资源申请和限制(必须设置!)
resources:
requests:
cpu: "100m"       # 0.1 CPU
memory: "128Mi"
limits:
cpu: "500m"       # 0.5 CPU
memory: "512Mi"
# 环境变量
env:
- name: NODE_ENV
value: "production"
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: db_host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secrets
key: db_password
# 探针
startupProbe:
httpGet:
path: /healthz
port: http
failureThreshold: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
# 优雅停止
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# 挂载卷
volumeMounts:
- name: config
mountPath: /etc/app
readOnly: true
- name: tmp
mountPath: /tmp
# 终止宽限期(必须 > preStop 时间)
terminationGracePeriodSeconds: 30
volumes:
- name: config
configMap:
name: app-config
- name: tmp
emptyDir: {}
# 节点选择
nodeSelector:
node-role: worker
# 反亲和(多副本分散到不同节点)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [myapp]
topologyKey: kubernetes.io/hostname

常见错误排查

# 查看 Pod 状态
kubectl get pods -n production
# 查看 Pod 事件(失败原因)
kubectl describe pod myapp-xxxx -n production
# 查看容器日志
kubectl logs myapp-xxxx -n production
kubectl logs myapp-xxxx -n production --previous  # 上一次崩溃的日志
# 进入容器调试
kubectl exec -it myapp-xxxx -n production -- /bin/sh
状态 常见原因 诊断方法
ImagePullBackOff 镜像不存在或凭据错误 kubectl describe pod 查看 Events
CrashLoopBackOff 容器启动后崩溃,K8s 反复重启 kubectl logs --previous 查看崩溃原因
Pending 卡住 资源不足(CPU/内存/PVC)或节点选择器不匹配 kubectl describe pod 查看 Events,kubectl describe node 查看资源
OOMKilled 内存超过 limits,被 OOM Killer 杀死 增加 limits.memory 或优化内存使用
Readiness 失败 应用未通过就绪检查,Traffic 不进来 检查 /ready 接口是否正常返回 200

下一节Deployment、StatefulSet 与 DaemonSet