2 min read375 words

LLM 生产趋势

LLM 生产工程正在快速演进。本章梳理关键趋势，帮助团队提前布局。

趋势全景

graph TB A[LLM 生产趋势] --> B[推理优化] A --> C[模型演进] A --> D[部署模式] A --> E[工程实践] B --> B1[推测解码
Speculative Decoding] B --> B2[稀疏注意力
Sparse Attention] C --> C1[MoE 架构普及] C --> C2[小模型复兴] D --> D1[边缘推理
On-Device LLM] D --> D2[Serverless LLM] E --> E1[LLMOps 成熟] E --> E2[AI Gateway 标准化] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px

推理优化趋势

推测解码

"""
推测解码概念示意
"""
from dataclasses import dataclass
@dataclass
class SpeculativeDecodingConfig:
"""推测解码配置"""
draft_model: str          # 小模型（7B）
target_model: str         # 大模型（70B）
num_speculative_tokens: int = 5    # 每次投机生成的 token 数
acceptance_threshold: float = 0.8  # 接受阈值
def speculative_decode_step(draft_model, target_model, context: str, k: int = 5):
"""
推测解码一步：
1. 小模型生成 k 个候选 token
2. 大模型并行验证
3. 接受前缀中匹配的 token，拒绝后从大模型重采样
加速比：2-3x（取决于 draft 模型与 target 的一致率）
"""
# Step 1: Draft 模型快速生成 k 个 token
draft_tokens = []
for _ in range(k):
token = "draft_token"  # draft_model.generate(context, max_new_tokens=1)
draft_tokens.append(token)
# Step 2: Target 模型并行验证（一次 forward pass）
# 大模型一次性评估所有候选 token 的概率
accepted = []
for token in draft_tokens:
# 如果大模型同意小模型的预测，接受
accepted.append(token)
# 一旦大模型拒绝某个 token，停止接受
return accepted

技术对比

技术	加速比	显存开销	编码变化	质量影响
推测解码	2-3x	+20%（draft 模型）	中	无损
KV Cache 量化	1.3-1.5x	-50% cache	低	极低损失
PagedAttention	1.5-2x	-30% 碎片	低	无损
Flash Attention	2-3x	-20% peak	低	无损

小模型复兴

graph LR A[小模型趋势] --> B[1-3B 参数] B --> C[手机端推理
< 2GB 内存] B --> D[IoT 设备
Raspberry Pi] B --> E[浏览器内推理
WebLLM / ONNX] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px

模型	参数量	推理设备	延迟	主要用途
Phi-3-mini	3.8B	手机/PC	50ms/token	文本生成
Gemma-2-2B	2B	手机	30ms/token	轻量助手
Qwen2.5-1.5B	1.5B	IoT	80ms/token	指令跟随
SmolLM-135M	135M	浏览器	10ms/token	简单补全

Serverless LLM

"""
Serverless LLM 抽象层
"""
from dataclasses import dataclass
from enum import Enum
class ScalingMode(Enum):
SCALE_TO_ZERO = "scale_to_zero"     # 完全无请求时缩到 0
MIN_INSTANCES = "min_instances"      # 保留最少实例
AUTO_SCALE = "auto_scale"           # 根据队列深度自动扩缩
@dataclass
class ServerlessConfig:
"""Serverless 配置"""
model: str
scaling_mode: ScalingMode = ScalingMode.AUTO_SCALE
min_replicas: int = 0
max_replicas: int = 10
target_queue_depth: int = 5        # 每实例目标队列深度
cold_start_timeout_s: float = 30.0  # 冷启动超时
class ServerlessLLM:
"""Serverless LLM 服务"""
# 冷启动时间参考
COLD_START_ESTIMATES = {
"7b": 15,     # 秒
"13b": 25,
"70b": 60,
}
def __init__(self, config: ServerlessConfig):
self.config = config
self._instances = 0
def estimate_cold_start(self) -> float:
"""预估冷启动时间"""
for size_key, seconds in self.COLD_START_ESTIMATES.items():
if size_key in self.config.model.lower():
return seconds
return 30.0  # 默认
def calculate_replicas(self, queue_depth: int) -> int:
"""计算需要的实例数"""
if queue_depth == 0 and self.config.scaling_mode == ScalingMode.SCALE_TO_ZERO:
return 0
needed = max(1, queue_depth // self.config.target_queue_depth)
return min(
max(needed, self.config.min_replicas),
self.config.max_replicas,
)

LLMOps 工具链成熟度

领域	工具	成熟度	说明
推理框架	vLLM, TGI	★★★★★	生产级
编排	LangChain, LlamaIndex	★★★★☆	快速迭代中
评估	LangSmith, Braintrust	★★★☆☆	标准尚未统一
监控	LangFuse, Helicone	★★★☆☆	生态整合中
安全	Guardrails, NeMo	★★★☆☆	持续加固
Fine-tuning	Axolotl, Unsloth	★★★★☆	工具成熟
数据管理	Argilla, Label Studio	★★★★☆	标注评估一体

未来展望

graph TB A[2024-2025] --> B[推理效率
推测解码/MoE 普及] A --> C[端侧部署
手机/浏览器 LLM] A --> D[Agent 基础设施
MCP/Tool 标准化] E[2025-2026] --> F[Serverless GPU
秒级冷启动] E --> G[多模态标准化
统一推理接口] E --> H[自治 Agent
长期记忆 + 规划] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style E fill:#e8f5e9,stroke:#388e3c,stroke-width:2px

本章小结

趋势	时间线	对生产的影响
推测解码	已可用	吞吐提升 2-3x
小模型	已可用	边缘场景降本
Serverless LLM	2025 成熟	按需付费，无运维
LLMOps	快速演进	标准化工具链
Agent 基础设施	2025-2026	MCP 等协议统一

延伸阅读：LLM 评估与测试指南 · AI Agent 实战指南