微调评估基准与指标
High Contrast
Dark Mode
Light Mode
Sepia
Forest
2 min read314 words

微调评估基准与指标

微调后的模型需要系统化评估——不能只看 Loss 曲线。本章构建完整的评估体系。

评估维度

graph TB A[微调评估] --> B[任务性能] A --> C[通用能力] A --> D[安全对齐] A --> E[运行效率] B --> B1[目标任务准确率
F1/BLEU/ROUGE] C --> C1[基准测试
MMLU/HellaSwag] D --> D1[有害输出率
拒绝率/合规] E --> E1[推理延迟
吞吐量/成本] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px

核心指标

指标 适用任务 计算方式 基准线
Accuracy 分类 正确预测 / 总数 > 基座模型
F1 Score NER/分类 Precision × Recall 调和平均 > 0.85
BLEU 翻译/生成 n-gram 重合度 > 30
ROUGE-L 摘要 最长公共子序列 > 40
Win Rate 偏好对齐 人工/LLM-Judge 胜率 > 50%

评估框架实现

"""
微调模型评估框架
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
class MetricType(Enum):
ACCURACY = "accuracy"
F1 = "f1"
BLEU = "bleu"
ROUGE = "rouge"
WIN_RATE = "win_rate"
LATENCY = "latency"
@dataclass
class EvalSample:
"""评估样本"""
prompt: str
expected: str
actual: str = ""
metadata: dict = field(default_factory=dict)
@dataclass
class EvalResult:
"""评估结果"""
metric: MetricType
score: float
samples: int
details: dict = field(default_factory=dict)
class FineTuneEvaluator:
"""微调评估器"""
def __init__(self):
self._results: list[EvalResult] = []
def eval_accuracy(self, samples: list[EvalSample]) -> EvalResult:
"""准确率评估"""
correct = sum(
1 for s in samples
if s.actual.strip().lower() == s.expected.strip().lower()
)
score = correct / len(samples) if samples else 0
result = EvalResult(
metric=MetricType.ACCURACY,
score=score,
samples=len(samples),
details={"correct": correct, "total": len(samples)},
)
self._results.append(result)
return result
def eval_rouge_l(self, samples: list[EvalSample]) -> EvalResult:
"""ROUGE-L 评估(简化版)"""
scores = []
for s in samples:
lcs_len = self._lcs_length(s.expected, s.actual)
precision = lcs_len / len(s.actual) if s.actual else 0
recall = lcs_len / len(s.expected) if s.expected else 0
f1 = (
2 * precision * recall / (precision + recall)
if (precision + recall) > 0
else 0
)
scores.append(f1)
avg_score = sum(scores) / len(scores) if scores else 0
result = EvalResult(
metric=MetricType.ROUGE,
score=avg_score,
samples=len(samples),
)
self._results.append(result)
return result
def eval_win_rate(
self,
prompts: list[str],
baseline_responses: list[str],
finetuned_responses: list[str],
judge_fn=None,
) -> EvalResult:
"""Win Rate 评估"""
wins = 0
ties = 0
losses = 0
for prompt, base_resp, ft_resp in zip(
prompts, baseline_responses, finetuned_responses
):
if judge_fn:
verdict = judge_fn(prompt, base_resp, ft_resp)
else:
# 简单长度启发式(实际应用 LLM-as-Judge)
verdict = 1 if len(ft_resp) > len(base_resp) else -1
if verdict > 0:
wins += 1
elif verdict == 0:
ties += 1
else:
losses += 1
total = wins + ties + losses
win_rate = wins / total if total else 0
result = EvalResult(
metric=MetricType.WIN_RATE,
score=win_rate,
samples=total,
details={"wins": wins, "ties": ties, "losses": losses},
)
self._results.append(result)
return result
def generate_report(self) -> dict:
"""生成评估报告"""
return {
"metrics": [
{
"metric": r.metric.value,
"score": round(r.score, 4),
"samples": r.samples,
"details": r.details,
}
for r in self._results
],
"overall_pass": all(r.score > 0.5 for r in self._results),
}
@staticmethod
def _lcs_length(s1: str, s2: str) -> int:
"""最长公共子序列长度"""
m, n = len(s1), len(s2)
if m == 0 or n == 0:
return 0
prev = [0] * (n + 1)
for i in range(1, m + 1):
curr = [0] * (n + 1)
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
curr[j] = prev[j - 1] + 1
else:
curr[j] = max(prev[j], curr[j - 1])
prev = curr
return prev[n]

通用能力保持测试

graph LR A[微调后检查] --> B[目标任务提升?] A --> C[通用能力未退化?] A --> D[安全对齐正常?] B --> E[✓ PASS: 在任务上优于基座] C --> F[✓ PASS: MMLU 下降 < 2%] D --> G[✓ PASS: 有害率 < 基座] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style E fill:#e8f5e9,stroke:#388e3c,stroke-width:2px

评估检查清单

检查项 标准 通过条件
目标任务 准确率/F1 > 基座模型 5%+
通用能力 MMLU 等基准 下降 < 2%
生成质量 Win Rate > 50%
有害输出 安全评估 有害率 < 1%
推理延迟 P95 latency 与基座相当
过拟合 验证集 Loss 与训练集差距 < 20%

本章小结

要点 说明
多维评估 任务性能 + 通用能力 + 安全 + 效率
Win Rate DPO/RLHF 必用评估指标
能力保持 微调不能牺牲通用能力
自动化 CI/CD 集成评估,每次训练自动跑

下一章:模型对比与选择