3 min read511 words

Pod 生命周期与探针

核心问题：Kubernetes 的最小调度单元是 Pod——Pod 怎样从创建到运行，探针怎样让 Kubernetes 知道你的应用是否健康？

Pod 是什么

Pod 是 Kubernetes 的最小部署单元，包含一个或多个紧密相关的容器，共享： - 网络命名空间（同一 Pod 内容器通过 localhost 通信） - 存储卷（可挂载相同的 Volume） - 生命周期（一起启动，一起终止）

# 最简 Pod（实际使用中几乎不直接创建 Pod，而是通过 Deployment）
apiVersion: v1
kind: Pod
metadata:
name: my-app
namespace: default
labels:
app: my-app
version: v1.0
spec:
containers:
- name: app
image: nginx:1.25
ports:
- containerPort: 80

Pod 生命周期状态

stateDiagram-v2 [*] --> Pending: kubectl apply Pending --> Running: 镜像拉取成功 + 容器启动 Pending --> Failed: 镜像拉取失败 / 资源不足 Running --> Succeeded: 容器正常退出（exit 0） Running --> Failed: 容器异常退出 Running --> Unknown: 节点失联 Succeeded --> [*] Failed --> [*]

状态	说明
`Pending`	Pod 已接受，等待调度或镜像拉取
`Running`	至少一个容器正在运行
`Succeeded`	所有容器以 exit code 0 结束（Job 常见）
`Failed`	至少一个容器以非 0 退出且不再重启
`Unknown`	无法获取 Pod 状态（节点通信问题）

容器生命周期钩子

spec:
containers:
- name: app
image: myapp:1.0
lifecycle:
postStart:              # 容器启动后立即执行（异步，不阻塞启动）
exec:
command: ["/bin/sh", "-c", "echo 'started' > /tmp/started"]
preStop:                # 容器停止前执行（同步，阻塞删除）
exec:
command: ["/bin/sh", "-c", "sleep 15"]  # 优雅退出等待时间

三种探针（Probe）

探针是 Kubernetes 对应用进行健康检查的机制：

graph LR L[livenessProbe
存活探针] -->|失败| RESTART[重启容器] R[readinessProbe
就绪探针] -->|失败| REMOVEEP[从 Service 摘除] S[startupProbe
启动探针] -->|未通过| BLOCK[阻止其他探针运行] S -->|通过| L S -->|通过| R

livenessProbe：存活探针

检测容器是否还在正常运行（如检测死锁）。失败则重启容器：

livenessProbe:
httpGet:
path: /healthz
port: 3000
httpHeaders:
- name: X-Custom-Header
value: liveness-check
initialDelaySeconds: 30    # 容器启动后等待 30 秒再开始探测
periodSeconds: 10          # 每 10 秒探测一次
timeoutSeconds: 5          # 探测超时时间
failureThreshold: 3        # 连续失败 3 次才重启（避免抖动）
successThreshold: 1        # 成功 1 次即认为恢复

readinessProbe：就绪探针

检测容器是否准备好接收流量。失败则从 Service Endpoints 中移除（不重启容器）：

readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 1

startupProbe：启动探针（慢启动应用必备）

给启动慢的应用（如 JVM 预热、数据库连接池初始化）额外时间：

startupProbe:
httpGet:
path: /healthz
port: 3000
failureThreshold: 30       # 最多失败 30 次
periodSeconds: 10          # 每 10 秒探测 → 最长等待 300 秒

启动探针通过后，liveness 和 readiness 探针才开始运行。

探针的三种检测方式

# 1. HTTP GET（最常用）
livenessProbe:
httpGet:
path: /healthz
port: 3000
scheme: HTTP
# 2. TCP Socket（数据库、消息队列等）
livenessProbe:
tcpSocket:
port: 5432
# 3. Exec（执行命令，exit 0 = 健康）
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "redis-cli ping | grep PONG"

完整的 Pod Spec 最佳实践

apiVersion: v1
kind: Pod
metadata:
name: myapp
labels:
app: myapp
version: v1.2.3
spec:
# 初始化容器（在主容器启动前运行）
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z db-service 5432; do echo waiting; sleep 2; done']
containers:
- name: app
image: myapp:1.2.3
imagePullPolicy: IfNotPresent   # Always | IfNotPresent | Never
ports:
- name: http
containerPort: 3000
protocol: TCP
# 资源申请和限制（必须设置！）
resources:
requests:
cpu: "100m"       # 0.1 CPU
memory: "128Mi"
limits:
cpu: "500m"       # 0.5 CPU
memory: "512Mi"
# 环境变量
env:
- name: NODE_ENV
value: "production"
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: db_host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: app-secrets
key: db_password
# 探针
startupProbe:
httpGet:
path: /healthz
port: http
failureThreshold: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
# 优雅停止
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
# 挂载卷
volumeMounts:
- name: config
mountPath: /etc/app
readOnly: true
- name: tmp
mountPath: /tmp
# 终止宽限期（必须 > preStop 时间）
terminationGracePeriodSeconds: 30
volumes:
- name: config
configMap:
name: app-config
- name: tmp
emptyDir: {}
# 节点选择
nodeSelector:
node-role: worker
# 反亲和（多副本分散到不同节点）
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [myapp]
topologyKey: kubernetes.io/hostname

常见错误排查

# 查看 Pod 状态
kubectl get pods -n production
# 查看 Pod 事件（失败原因）
kubectl describe pod myapp-xxxx -n production
# 查看容器日志
kubectl logs myapp-xxxx -n production
kubectl logs myapp-xxxx -n production --previous  # 上一次崩溃的日志
# 进入容器调试
kubectl exec -it myapp-xxxx -n production -- /bin/sh

状态	常见原因	诊断方法
`ImagePullBackOff`	镜像不存在或凭据错误	`kubectl describe pod` 查看 Events
`CrashLoopBackOff`	容器启动后崩溃，K8s 反复重启	`kubectl logs --previous` 查看崩溃原因
`Pending` 卡住	资源不足（CPU/内存/PVC）或节点选择器不匹配	`kubectl describe pod` 查看 Events，`kubectl describe node` 查看资源
`OOMKilled`	内存超过 limits，被 OOM Killer 杀死	增加 `limits.memory` 或优化内存使用
Readiness 失败	应用未通过就绪检查，Traffic 不进来	检查 `/ready` 接口是否正常返回 200

下一节：Deployment、StatefulSet 与 DaemonSet