微调评估基准与指标
微调后的模型需要系统化评估——不能只看 Loss 曲线。本章构建完整的评估体系。
评估维度
graph TB
A[微调评估] --> B[任务性能]
A --> C[通用能力]
A --> D[安全对齐]
A --> E[运行效率]
B --> B1[目标任务准确率
F1/BLEU/ROUGE] C --> C1[基准测试
MMLU/HellaSwag] D --> D1[有害输出率
拒绝率/合规] E --> E1[推理延迟
吞吐量/成本] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
F1/BLEU/ROUGE] C --> C1[基准测试
MMLU/HellaSwag] D --> D1[有害输出率
拒绝率/合规] E --> E1[推理延迟
吞吐量/成本] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
核心指标
| 指标 | 适用任务 | 计算方式 | 基准线 |
|---|---|---|---|
| Accuracy | 分类 | 正确预测 / 总数 | > 基座模型 |
| F1 Score | NER/分类 | Precision × Recall 调和平均 | > 0.85 |
| BLEU | 翻译/生成 | n-gram 重合度 | > 30 |
| ROUGE-L | 摘要 | 最长公共子序列 | > 40 |
| Win Rate | 偏好对齐 | 人工/LLM-Judge 胜率 | > 50% |
评估框架实现
"""
微调模型评估框架
"""
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
class MetricType(Enum):
ACCURACY = "accuracy"
F1 = "f1"
BLEU = "bleu"
ROUGE = "rouge"
WIN_RATE = "win_rate"
LATENCY = "latency"
@dataclass
class EvalSample:
"""评估样本"""
prompt: str
expected: str
actual: str = ""
metadata: dict = field(default_factory=dict)
@dataclass
class EvalResult:
"""评估结果"""
metric: MetricType
score: float
samples: int
details: dict = field(default_factory=dict)
class FineTuneEvaluator:
"""微调评估器"""
def __init__(self):
self._results: list[EvalResult] = []
def eval_accuracy(self, samples: list[EvalSample]) -> EvalResult:
"""准确率评估"""
correct = sum(
1 for s in samples
if s.actual.strip().lower() == s.expected.strip().lower()
)
score = correct / len(samples) if samples else 0
result = EvalResult(
metric=MetricType.ACCURACY,
score=score,
samples=len(samples),
details={"correct": correct, "total": len(samples)},
)
self._results.append(result)
return result
def eval_rouge_l(self, samples: list[EvalSample]) -> EvalResult:
"""ROUGE-L 评估(简化版)"""
scores = []
for s in samples:
lcs_len = self._lcs_length(s.expected, s.actual)
precision = lcs_len / len(s.actual) if s.actual else 0
recall = lcs_len / len(s.expected) if s.expected else 0
f1 = (
2 * precision * recall / (precision + recall)
if (precision + recall) > 0
else 0
)
scores.append(f1)
avg_score = sum(scores) / len(scores) if scores else 0
result = EvalResult(
metric=MetricType.ROUGE,
score=avg_score,
samples=len(samples),
)
self._results.append(result)
return result
def eval_win_rate(
self,
prompts: list[str],
baseline_responses: list[str],
finetuned_responses: list[str],
judge_fn=None,
) -> EvalResult:
"""Win Rate 评估"""
wins = 0
ties = 0
losses = 0
for prompt, base_resp, ft_resp in zip(
prompts, baseline_responses, finetuned_responses
):
if judge_fn:
verdict = judge_fn(prompt, base_resp, ft_resp)
else:
# 简单长度启发式(实际应用 LLM-as-Judge)
verdict = 1 if len(ft_resp) > len(base_resp) else -1
if verdict > 0:
wins += 1
elif verdict == 0:
ties += 1
else:
losses += 1
total = wins + ties + losses
win_rate = wins / total if total else 0
result = EvalResult(
metric=MetricType.WIN_RATE,
score=win_rate,
samples=total,
details={"wins": wins, "ties": ties, "losses": losses},
)
self._results.append(result)
return result
def generate_report(self) -> dict:
"""生成评估报告"""
return {
"metrics": [
{
"metric": r.metric.value,
"score": round(r.score, 4),
"samples": r.samples,
"details": r.details,
}
for r in self._results
],
"overall_pass": all(r.score > 0.5 for r in self._results),
}
@staticmethod
def _lcs_length(s1: str, s2: str) -> int:
"""最长公共子序列长度"""
m, n = len(s1), len(s2)
if m == 0 or n == 0:
return 0
prev = [0] * (n + 1)
for i in range(1, m + 1):
curr = [0] * (n + 1)
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
curr[j] = prev[j - 1] + 1
else:
curr[j] = max(prev[j], curr[j - 1])
prev = curr
return prev[n]
通用能力保持测试
graph LR
A[微调后检查] --> B[目标任务提升?]
A --> C[通用能力未退化?]
A --> D[安全对齐正常?]
B --> E[✓ PASS: 在任务上优于基座]
C --> F[✓ PASS: MMLU 下降 < 2%]
D --> G[✓ PASS: 有害率 < 基座]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
style E fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
评估检查清单
| 检查项 | 标准 | 通过条件 |
|---|---|---|
| 目标任务 | 准确率/F1 | > 基座模型 5%+ |
| 通用能力 | MMLU 等基准 | 下降 < 2% |
| 生成质量 | Win Rate | > 50% |
| 有害输出 | 安全评估 | 有害率 < 1% |
| 推理延迟 | P95 latency | 与基座相当 |
| 过拟合 | 验证集 Loss | 与训练集差距 < 20% |
本章小结
| 要点 | 说明 |
|---|---|
| 多维评估 | 任务性能 + 通用能力 + 安全 + 效率 |
| Win Rate | DPO/RLHF 必用评估指标 |
| 能力保持 | 微调不能牺牲通用能力 |
| 自动化 | CI/CD 集成评估,每次训练自动跑 |
下一章:模型对比与选择