模型横向评测与选型
High Contrast
Dark Mode
Light Mode
Sepia
Forest
2 min read421 words

模型横向评测与选型

2024-2025 年 LLM 市场已形成多极竞争格局。选择合适模型需要平衡性能、成本、延迟和合规要求。

模型生态全景

graph TB A[LLM 模型生态] --> B[闭源商用] A --> C[开源开放] A --> D[国产模型] B --> B1[GPT-4o / o1] B --> B2[Claude 3.5 / Opus] B --> B3[Gemini 1.5 Pro] C --> C1[Llama 3.1] C --> C2[Mistral Large] C --> C3[Qwen2.5] D --> D1[DeepSeek-V3] D --> D2[GLM-4] D --> D3[文心一言 4.0] style B fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style C fill:#c8e6c9,stroke:#43a047,stroke-width:2px style D fill:#fff9c4,stroke:#f9a825,stroke-width:2px

模型评测框架

from dataclasses import dataclass
from enum import Enum
class Capability(Enum):
REASONING = "推理"
CODING = "编程"
CHINESE = "中文"
MATH = "数学"
CREATIVE = "创意写作"
INSTRUCTION = "指令遵循"
MULTIMODAL = "多模态"
@dataclass
class ModelProfile:
name: str
provider: str
params_b: float          # 参数量(十亿)
context_window: int      # 上下文窗口
input_price: float       # $/1M tokens
output_price: float      # $/1M tokens
scores: dict[Capability, int]  # 1-10 评分
@property
def avg_score(self) -> float:
if not self.scores:
return 0.0
return sum(self.scores.values()) / len(self.scores)
@property
def cost_per_1k_avg(self) -> float:
"""平均每千 token 成本(假设 1:1 输入输出)"""
return (self.input_price + self.output_price) / 2 / 1000
@property
def value_score(self) -> float:
"""性价比 = 能力/成本"""
cost = self.cost_per_1k_avg
if cost == 0:
return float("inf")
return self.avg_score / cost
# 主流模型画像(2024 年末数据)
MODELS = [
ModelProfile(
"GPT-4o", "OpenAI", 200, 128000,
2.5, 10.0,
{
Capability.REASONING: 9,
Capability.CODING: 9,
Capability.CHINESE: 8,
Capability.MATH: 9,
Capability.CREATIVE: 8,
Capability.INSTRUCTION: 9,
Capability.MULTIMODAL: 9,
},
),
ModelProfile(
"Claude-3.5-Sonnet", "Anthropic", 175, 200000,
3.0, 15.0,
{
Capability.REASONING: 9,
Capability.CODING: 10,
Capability.CHINESE: 8,
Capability.MATH: 8,
Capability.CREATIVE: 9,
Capability.INSTRUCTION: 10,
Capability.MULTIMODAL: 8,
},
),
ModelProfile(
"DeepSeek-V3", "DeepSeek", 671, 65536,
0.14, 0.28,
{
Capability.REASONING: 8,
Capability.CODING: 9,
Capability.CHINESE: 9,
Capability.MATH: 9,
Capability.CREATIVE: 7,
Capability.INSTRUCTION: 8,
Capability.MULTIMODAL: 0,
},
),
ModelProfile(
"Llama-3.1-70B", "Meta", 70, 128000,
0.0, 0.0,  # 开源免费(自托管)
{
Capability.REASONING: 7,
Capability.CODING: 7,
Capability.CHINESE: 5,
Capability.MATH: 7,
Capability.CREATIVE: 6,
Capability.INSTRUCTION: 7,
Capability.MULTIMODAL: 0,
},
),
ModelProfile(
"Qwen2.5-72B", "Alibaba", 72, 131072,
0.0, 0.0,  # 开源免费
{
Capability.REASONING: 8,
Capability.CODING: 8,
Capability.CHINESE: 10,
Capability.MATH: 8,
Capability.CREATIVE: 7,
Capability.INSTRUCTION: 8,
Capability.MULTIMODAL: 0,
},
),
]
def recommend_model(
primary_capability: Capability,
budget_per_1m: float | None = None,
) -> list[tuple[ModelProfile, int]]:
"""根据核心能力需求推荐模型"""
candidates = []
for m in MODELS:
score = m.scores.get(primary_capability, 0)
if score == 0:
continue
if budget_per_1m is not None:
avg_cost = (m.input_price + m.output_price) / 2
if avg_cost > budget_per_1m:
continue
candidates.append((m, score))
candidates.sort(key=lambda x: x[1], reverse=True)
return candidates
# 示例:推荐中文能力最强的模型
for model, score in recommend_model(Capability.CHINESE):
print(f"{model.name}: 中文={score}, 均分={model.avg_score:.1f}")

主流模型横向对比

模型 发布方 参数量 上下文 中文 编程 推理 定价等级
GPT-4o OpenAI ~200B 128K ★★★★ ★★★★★ ★★★★★ 中高
Claude-3.5-Sonnet Anthropic ~175B 200K ★★★★ ★★★★★ ★★★★★ 中高
Gemini 1.5 Pro Google 未公开 1M ★★★★ ★★★★ ★★★★
DeepSeek-V3 DeepSeek 671B(MoE) 64K ★★★★★ ★★★★★ ★★★★ 极低
Llama 3.1 70B Meta 70B 128K ★★★ ★★★★ ★★★★ 免费
Qwen2.5-72B Alibaba 72B 128K ★★★★★ ★★★★ ★★★★ 免费
GLM-4 智谱 未公开 128K ★★★★★ ★★★★ ★★★★

选型决策树

graph TD A[选择 LLM] --> B{数据能否出境?} B -->|否| C{自托管?} B -->|是| D{预算} C -->|是| E[Qwen2.5 / Llama 3.1] C -->|否| F[DeepSeek-V3 / GLM-4] D -->|充足| G{核心需求} D -->|有限| H[DeepSeek-V3] G -->|编程| I[Claude-3.5-Sonnet] G -->|通用| J[GPT-4o] G -->|长文档| K[Gemini 1.5 Pro] style A fill:#ffcdd2,stroke:#c62828,stroke-width:2px style E fill:#c8e6c9,stroke:#43a047,stroke-width:2px style H fill:#fff9c4,stroke:#f9a825,stroke-width:2px style I fill:#e3f2fd,stroke:#1565c0,stroke-width:2px

选型建议

场景 首选 备选 理由
中文内容创作 Qwen2.5-72B DeepSeek-V3 中文表达最自然
代码生成 Claude-3.5-Sonnet GPT-4o 代码质量和指令遵循最佳
长文档处理 Gemini 1.5 Pro Claude-3.5-Sonnet 百万 token 上下文
低成本高量 DeepSeek-V3 GPT-4o-mini 性价比王者
数据合规(中国) Qwen2.5 自托管 GLM-4 数据不出境
多模态 GPT-4o Gemini 1.5 Pro 图文理解成熟

本章小结

延伸阅读:开源模型生态