部署方案与平台选型
选择正确的部署方案直接影响成本、延迟和运维复杂度。
部署方案全景
graph TB
A[LLM 部署方案] --> B[云 API 服务]
A --> C[自托管部署]
A --> D[混合方案]
B --> B1[OpenAI API]
B --> B2[Anthropic Claude]
B --> B3[Azure OpenAI]
B --> B4[Google Vertex AI]
C --> C1[vLLM]
C --> C2[TGI]
C --> C3[Ollama]
D --> D1[主力用云 API]
D --> D2[Fallback 用自托管]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style D fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
方案对比
| 维度 | 云 API | 自托管 GPU | 自托管 CPU |
|---|---|---|---|
| 启动成本 | $0 | $5K-50K | $1K-5K |
| 运行成本 | 按量付费 | 固定 + 电费 | 固定成本 |
| 延迟 | 200-1000ms | 50-500ms | 500-5000ms |
| 模型选择 | 受限于供应商 | 任意开源模型 | 小模型 |
| 运维复杂度 | 低 | 高 | 中 |
| 数据隐私 | 数据离境 | 完全控制 | 完全控制 |
| 适合阶段 | 早期/中期 | 规模化 | 边缘场景 |
云 API 接入最佳实践
"""
生产级 API 客户端封装
"""
import asyncio
import time
from dataclasses import dataclass
@dataclass
class APICallResult:
"""API 调用结果"""
content: str
model: str
tokens_in: int
tokens_out: int
latency_ms: float
cost_usd: float
class ProductionLLMClient:
"""生产级 LLM 客户端"""
# 各模型定价 (每 1K tokens)
PRICING = {
"gpt-4o": {"input": 0.0025, "output": 0.01},
"gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
"claude-3-5-sonnet": {"input": 0.003, "output": 0.015},
"claude-3-5-haiku": {"input": 0.0008, "output": 0.004},
}
def __init__(self, default_model: str = "gpt-4o-mini"):
self.default_model = default_model
self.total_cost = 0.0
self.total_requests = 0
async def chat(
self,
messages: list[dict],
model: str = None,
max_tokens: int = 1024,
temperature: float = 0.7,
stream: bool = False,
) -> APICallResult:
"""发送聊天请求"""
model = model or self.default_model
start = time.time()
# --- 实际中替换为真实 API 调用 ---
# 这里展示关键逻辑结构
await asyncio.sleep(0.1) # 模拟网络延迟
content = f"Response from {model}"
tokens_in = sum(len(m["content"]) // 4 for m in messages)
tokens_out = len(content) // 4
latency = (time.time() - start) * 1000
# 计算成本
pricing = self.PRICING.get(model, {"input": 0.01, "output": 0.03})
cost = (
tokens_in / 1000 * pricing["input"]
+ tokens_out / 1000 * pricing["output"]
)
self.total_cost += cost
self.total_requests += 1
return APICallResult(
content=content,
model=model,
tokens_in=tokens_in,
tokens_out=tokens_out,
latency_ms=round(latency, 1),
cost_usd=round(cost, 6),
)
def get_stats(self) -> dict:
return {
"total_requests": self.total_requests,
"total_cost_usd": round(self.total_cost, 4),
"avg_cost_per_request": round(
self.total_cost / max(self.total_requests, 1), 6
),
}
vLLM 自托管部署
"""
vLLM 部署配置(生产级)
启动命令:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 32768 \
--enable-prefix-caching \
--port 8000
"""
VLLM_CONFIG = {
"模型选择": {
"小型 (1-3B)": "适合边缘设备、嵌入式场景",
"中型 (7-13B)": "性价比最佳,单卡可跑",
"大型 (30-70B)": "效果好,需多卡",
},
"关键参数": {
"tensor-parallel-size": "GPU 并行数,70B 模型至少 2-4 卡",
"gpu-memory-utilization": "GPU 显存利用率,建议 0.85-0.95",
"max-model-len": "最大上下文长度,影响显存占用",
"max-num-batched-tokens": "批处理 token 上限,影响吞吐",
"enable-prefix-caching": "前缀缓存,提升重复前缀的性能",
},
"硬件推荐": {
"7B 模型": "1x A100 40GB 或 1x RTX 4090",
"13B 模型": "1x A100 80GB 或 2x RTX 4090",
"70B 模型": "4x A100 80GB",
},
}
# Docker Compose 部署配置
DOCKER_COMPOSE = """
# docker-compose.yml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
ports:
- "8000:8000"
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN}
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--tensor-parallel-size 1
--gpu-memory-utilization 0.9
--max-model-len 4096
--enable-prefix-caching
volumes:
- ./model-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
count: 1
nginx:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- vllm
"""
print("vLLM Docker Compose 配置已生成")
print("\n关键参数说明:")
for param, desc in VLLM_CONFIG["关键参数"].items():
print(f" --{param}: {desc}")
容器化部署
"""
生产级 Dockerfile
"""
DOCKERFILE = '''
# 多阶段构建
FROM python:3.12-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 生产镜像
FROM python:3.12-slim
# 安全: 非 root 用户
RUN useradd -m -u 1000 appuser
WORKDIR /app
# 复制依赖
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
# 复制代码
COPY . .
# 安全: 切换用户
USER appuser
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \\
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
'''
# Kubernetes 部署配置
K8S_DEPLOYMENT = """
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-service
spec:
replicas: 3
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: llm-service
template:
metadata:
labels:
app: llm-service
spec:
containers:
- name: llm-service
image: registry/llm-service:v1.0
ports:
- containerPort: 8000
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-api-key
"""
print("Dockerfile 和 K8s 配置模板已准备")
本章小结
| 场景 | 推荐方案 | 理由 |
|---|---|---|
| 初创/小团队 | 云 API (GPT-4o-mini) | 零运维,按量付费 |
| 数据敏感 | 自托管 vLLM | 数据不离境 |
| 高并发 | 云 API + 缓存 | 弹性扩展 |
| 边缘部署 | Ollama + 小模型 | 低成本离线 |
| 混合场景 | 云 API 主力 + 本地 Fallback | 兼顾成本和可用性 |
下一章:性能优化技术。