1 min read288 words

多集群 GitOps 策略

核心问题：当你有多个 Kubernetes 集群（dev/staging/production），怎样用一套 ArgoCD 统一管理？

多集群管理模式

graph TB subgraph "Hub & Spoke 模式（推荐）" HUB["中央 ArgoCD
（管理集群）"] DEV["Dev 集群"] STG["Staging 集群"] PRD["Production 集群"] HUB -->|管理| DEV HUB -->|管理| STG HUB -->|管理| PRD end

所有 ArgoCD Application 都在中央管理集群中定义，ArgoCD 负责将配置推送到各目标集群。

注册目标集群

# 登录 ArgoCD
argocd login argocd.example.com --username admin
# 添加集群（ArgoCD 自动创建 ServiceAccount 和 RBAC）
argocd cluster add staging-cluster \
--kubeconfig ~/.kube/config \
--kube-context staging-context \
--name staging
argocd cluster add production-cluster \
--kubeconfig ~/.kube/config \
--kube-context production-context \
--name production
# 查看已注册集群
argocd cluster list
# SERVER                          NAME         STATUS   MESSAGE
# https://kubernetes.default.svc  in-cluster   Unknown  Cluster has no application
# https://staging.k8s.example.com  staging      OK
# https://prod.k8s.example.com     production   OK

App of Apps 模式

用一个"根 Application"管理所有其他 Application 的创建，实现完全声明式的多应用管理：

infra-repo/
└── clusters/
├── staging/
│   ├── root-app.yaml          # 根 App（只有这个需要手动创建）
│   ├── api-app.yaml           # 子 App：api
│   ├── web-app.yaml           # 子 App：web
│   └── monitoring-app.yaml    # 子 App：监控
└── production/
├── root-app.yaml
├── api-app.yaml
└── web-app.yaml

# clusters/production/root-app.yaml（根 Application）
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-root
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/myorg/infra-repo.git
targetRevision: main
path: clusters/production    # 整个目录作为 App 集合
destination:
server: https://kubernetes.default.svc   # 部署到管理集群（argocd 本身）
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true

# clusters/production/api-app.yaml（子 Application）
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-production
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/myorg/infra-repo.git
targetRevision: main
path: apps/api/overlays/production
destination:
server: https://prod.k8s.example.com   # 目标：生产集群
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

ApplicationSet：自动生成多环境 Application

ApplicationSet 根据模板自动为多个集群/环境生成 Application：

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: api-all-envs
namespace: argocd
spec:
generators:
# 方式 1：按集群自动生成
- clusters:
selector:
matchLabels:
environment: production   # 带有此标签的集群
# 方式 2：按列表生成
- list:
elements:
- env: dev
cluster: https://dev.k8s.example.com
namespace: dev
- env: staging
cluster: https://staging.k8s.example.com
namespace: staging
- env: production
cluster: https://prod.k8s.example.com
namespace: production
template:
metadata:
name: "api-{{ env }}"         # 使用生成器变量
spec:
project: default
source:
repoURL: https://github.com/myorg/infra-repo.git
targetRevision: main
path: "apps/api/overlays/{{ env }}"
destination:
server: "{{ cluster }}"
namespace: "{{ namespace }}"
syncPolicy:
automated:
prune: true
selfHeal: "{{ eq env 'dev' }}"    # 仅 dev 环境自动自愈

多集群镜像晋级流水线

sequenceDiagram participant CI as GitHub Actions participant INFRA as infra-repo participant ARGO as ArgoCD participant DEV as Dev 集群 participant STG as Staging 集群 participant PRD as Production 集群 CI->>INFRA: 更新 dev 环境 image tag ARGO->>DEV: 自动同步（selfHeal=true） DEV-->>ARGO: 部署成功，健康检查通过 ARGO-->>CI: Webhook 通知 Note over CI: 运行集成测试 CI->>INFRA: 更新 staging 环境 image tag ARGO->>STG: 自动同步 STG-->>CI: 测试通过 Note over CI: 等待人工审批（GitHub Environments） CI->>INFRA: 更新 production 环境 image tag ARGO-->>PRD: 手动触发同步（syncPolicy: Manual）

灾难恢复：集群重建

GitOps 的最大优势之一：集群数据丢失后，只需重建 ArgoCD 并指向 Git 仓库，所有应用自动恢复：

# 1. 新建 EKS 集群（Terraform）
terraform apply
# 2. 安装 ArgoCD
helm upgrade --install argocd argo/argo-cd -n argocd --create-namespace
# 3. 只需应用一个 YAML（根 Application）
kubectl apply -f clusters/production/root-app.yaml
# 4. ArgoCD 自动同步所有子 Application，重建所有资源
argocd app wait production-root --health --timeout 600
# 整个集群在 10-20 分钟内恢复（不需要 runbook，Git 就是 runbook）

常见错误

错误	原因	解决
`cluster not registered`	新集群未在 ArgoCD 中注册	`argocd cluster add` 注册集群
App of Apps 自我删除	根 App prune 了 argocd namespace 的资源	根 App 的 destination 不要设 prune，或加 `argocd.argoproj.io/managed-by` 注解
`Unable to connect to cluster`	kubeconfig 中的 token 过期	重新 `argocd cluster add`，或更新 ServiceAccount Token

下一章：可观测性：Prometheus、Grafana 与告警