模型横向评测与选型
2024-2025 年 LLM 市场已形成多极竞争格局。选择合适模型需要平衡性能、成本、延迟和合规要求。
模型生态全景
graph TB
A[LLM 模型生态] --> B[闭源商用]
A --> C[开源开放]
A --> D[国产模型]
B --> B1[GPT-4o / o1]
B --> B2[Claude 3.5 / Opus]
B --> B3[Gemini 1.5 Pro]
C --> C1[Llama 3.1]
C --> C2[Mistral Large]
C --> C3[Qwen2.5]
D --> D1[DeepSeek-V3]
D --> D2[GLM-4]
D --> D3[文心一言 4.0]
style B fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style C fill:#c8e6c9,stroke:#43a047,stroke-width:2px
style D fill:#fff9c4,stroke:#f9a825,stroke-width:2px
模型评测框架
from dataclasses import dataclass
from enum import Enum
class Capability(Enum):
REASONING = "推理"
CODING = "编程"
CHINESE = "中文"
MATH = "数学"
CREATIVE = "创意写作"
INSTRUCTION = "指令遵循"
MULTIMODAL = "多模态"
@dataclass
class ModelProfile:
name: str
provider: str
params_b: float # 参数量(十亿)
context_window: int # 上下文窗口
input_price: float # $/1M tokens
output_price: float # $/1M tokens
scores: dict[Capability, int] # 1-10 评分
@property
def avg_score(self) -> float:
if not self.scores:
return 0.0
return sum(self.scores.values()) / len(self.scores)
@property
def cost_per_1k_avg(self) -> float:
"""平均每千 token 成本(假设 1:1 输入输出)"""
return (self.input_price + self.output_price) / 2 / 1000
@property
def value_score(self) -> float:
"""性价比 = 能力/成本"""
cost = self.cost_per_1k_avg
if cost == 0:
return float("inf")
return self.avg_score / cost
# 主流模型画像(2024 年末数据)
MODELS = [
ModelProfile(
"GPT-4o", "OpenAI", 200, 128000,
2.5, 10.0,
{
Capability.REASONING: 9,
Capability.CODING: 9,
Capability.CHINESE: 8,
Capability.MATH: 9,
Capability.CREATIVE: 8,
Capability.INSTRUCTION: 9,
Capability.MULTIMODAL: 9,
},
),
ModelProfile(
"Claude-3.5-Sonnet", "Anthropic", 175, 200000,
3.0, 15.0,
{
Capability.REASONING: 9,
Capability.CODING: 10,
Capability.CHINESE: 8,
Capability.MATH: 8,
Capability.CREATIVE: 9,
Capability.INSTRUCTION: 10,
Capability.MULTIMODAL: 8,
},
),
ModelProfile(
"DeepSeek-V3", "DeepSeek", 671, 65536,
0.14, 0.28,
{
Capability.REASONING: 8,
Capability.CODING: 9,
Capability.CHINESE: 9,
Capability.MATH: 9,
Capability.CREATIVE: 7,
Capability.INSTRUCTION: 8,
Capability.MULTIMODAL: 0,
},
),
ModelProfile(
"Llama-3.1-70B", "Meta", 70, 128000,
0.0, 0.0, # 开源免费(自托管)
{
Capability.REASONING: 7,
Capability.CODING: 7,
Capability.CHINESE: 5,
Capability.MATH: 7,
Capability.CREATIVE: 6,
Capability.INSTRUCTION: 7,
Capability.MULTIMODAL: 0,
},
),
ModelProfile(
"Qwen2.5-72B", "Alibaba", 72, 131072,
0.0, 0.0, # 开源免费
{
Capability.REASONING: 8,
Capability.CODING: 8,
Capability.CHINESE: 10,
Capability.MATH: 8,
Capability.CREATIVE: 7,
Capability.INSTRUCTION: 8,
Capability.MULTIMODAL: 0,
},
),
]
def recommend_model(
primary_capability: Capability,
budget_per_1m: float | None = None,
) -> list[tuple[ModelProfile, int]]:
"""根据核心能力需求推荐模型"""
candidates = []
for m in MODELS:
score = m.scores.get(primary_capability, 0)
if score == 0:
continue
if budget_per_1m is not None:
avg_cost = (m.input_price + m.output_price) / 2
if avg_cost > budget_per_1m:
continue
candidates.append((m, score))
candidates.sort(key=lambda x: x[1], reverse=True)
return candidates
# 示例:推荐中文能力最强的模型
for model, score in recommend_model(Capability.CHINESE):
print(f"{model.name}: 中文={score}, 均分={model.avg_score:.1f}")
主流模型横向对比
| 模型 | 发布方 | 参数量 | 上下文 | 中文 | 编程 | 推理 | 定价等级 |
|---|---|---|---|---|---|---|---|
| GPT-4o | OpenAI | ~200B | 128K | ★★★★ | ★★★★★ | ★★★★★ | 中高 |
| Claude-3.5-Sonnet | Anthropic | ~175B | 200K | ★★★★ | ★★★★★ | ★★★★★ | 中高 |
| Gemini 1.5 Pro | 未公开 | 1M | ★★★★ | ★★★★ | ★★★★ | 中 | |
| DeepSeek-V3 | DeepSeek | 671B(MoE) | 64K | ★★★★★ | ★★★★★ | ★★★★ | 极低 |
| Llama 3.1 70B | Meta | 70B | 128K | ★★★ | ★★★★ | ★★★★ | 免费 |
| Qwen2.5-72B | Alibaba | 72B | 128K | ★★★★★ | ★★★★ | ★★★★ | 免费 |
| GLM-4 | 智谱 | 未公开 | 128K | ★★★★★ | ★★★★ | ★★★★ | 低 |
选型决策树
graph TD
A[选择 LLM] --> B{数据能否出境?}
B -->|否| C{自托管?}
B -->|是| D{预算}
C -->|是| E[Qwen2.5 / Llama 3.1]
C -->|否| F[DeepSeek-V3 / GLM-4]
D -->|充足| G{核心需求}
D -->|有限| H[DeepSeek-V3]
G -->|编程| I[Claude-3.5-Sonnet]
G -->|通用| J[GPT-4o]
G -->|长文档| K[Gemini 1.5 Pro]
style A fill:#ffcdd2,stroke:#c62828,stroke-width:2px
style E fill:#c8e6c9,stroke:#43a047,stroke-width:2px
style H fill:#fff9c4,stroke:#f9a825,stroke-width:2px
style I fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
选型建议
| 场景 | 首选 | 备选 | 理由 |
|---|---|---|---|
| 中文内容创作 | Qwen2.5-72B | DeepSeek-V3 | 中文表达最自然 |
| 代码生成 | Claude-3.5-Sonnet | GPT-4o | 代码质量和指令遵循最佳 |
| 长文档处理 | Gemini 1.5 Pro | Claude-3.5-Sonnet | 百万 token 上下文 |
| 低成本高量 | DeepSeek-V3 | GPT-4o-mini | 性价比王者 |
| 数据合规(中国) | Qwen2.5 自托管 | GLM-4 | 数据不出境 |
| 多模态 | GPT-4o | Gemini 1.5 Pro | 图文理解成熟 |
本章小结
- 没有万能模型——根据场景选择,而非追求"最强"
- DeepSeek-V3 是性价比之王——MoE 架构让推理成本极低
- 中文场景首选 Qwen——中文能力一骑绝尘
- 编程首选 Claude——代码质量和指令遵循领先
- 模型路由混合使用——简单任务小模型、复杂任务大模型
- 关注合规要求——数据出境限制决定了可选范围
延伸阅读:开源模型生态