2 min read324 words

人工评估与众包标注

自动化评估快速高效，但对微妙的质量问题（语气、逻辑连贯、文化得体性）往往力不从心。人工评估是黄金标准，也是所有自动化指标的校准来源。

人工评估体系架构

graph TB A[人工评估体系] --> B[内部专家评估
Internal Expert] A --> C[众包平台评估
Crowdsource] A --> D[用户反馈
RLHF Signal] B --> B1[高专业度
低规模
高成本] C --> C1[中专业度
高规模
中成本] D --> D1[低专业度
超高规模
低成本] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style B fill:#c8e6c9,stroke:#43a047,stroke-width:2px style D fill:#fff9c4,stroke:#f9a825,stroke-width:2px

标注任务设计

from dataclasses import dataclass
from enum import Enum
from typing import Optional
class AnnotationMode(Enum):
LIKERT_SCALE = "likert"       # 李克特量表（1–5分）
COMPARATIVE = "comparative"   # A vs B 对比
BINARY = "binary"             # pass/fail
RUBRIC = "rubric"             # 多维逐项评分
@dataclass
class AnnotationTask:
"""标注任务定义"""
task_id: str
prompt: str
response_a: str
response_b: Optional[str]     # 对比评估时填写
annotation_mode: AnnotationMode
rubric: dict[str, str]        # 评分细则
@property
def is_comparative(self) -> bool:
return self.annotation_mode == AnnotationMode.COMPARATIVE and self.response_b
@dataclass
class AnnotationResult:
"""标注结果"""
task_id: str
annotator_id: str
score: float | str            # 数字评分 or "A wins"/"B wins"/"tie"
rubric_scores: dict[str, float]
comment: str
time_spent_sec: int
class IAA:
"""标注一致性 (Inter-Annotator Agreement) 计算器"""
@staticmethod
def cohens_kappa(ratings_a: list[int], ratings_b: list[int]) -> float:
"""
计算 Cohen's Kappa 标注一致性系数
Kappa > 0.8: 几乎完全一致 (Almost Perfect)
Kappa 0.6–0.8: 较高一致 (Substantial)
Kappa 0.4–0.6: 中等一致 (Moderate)
Kappa < 0.4: 较低，需重新设计标注指南
"""
assert len(ratings_a) == len(ratings_b), "评分数量需相同"
n = len(ratings_a)
categories = sorted(set(ratings_a) | set(ratings_b))
# 实测一致性
po = sum(1 for a, b in zip(ratings_a, ratings_b) if a == b) / n
# 期望一致性
pe = 0.0
for cat in categories:
pa = ratings_a.count(cat) / n
pb = ratings_b.count(cat) / n
pe += pa * pb
if pe == 1.0:
return 1.0
return (po - pe) / (1 - pe)
@staticmethod
def fleiss_kappa(ratings_matrix: list[list[int]], n_categories: int) -> float:
"""
Fleiss' Kappa：多标注者参与时使用
ratings_matrix: [样本][标注者] 二维评分矩阵
"""
n_items = len(ratings_matrix)
n_raters = len(ratings_matrix[0])
# 每个样本各类别被选择次数
pj_list = []
for j in range(n_categories):
pj = sum(row.count(j) for row in ratings_matrix) / (n_items * n_raters)
pj_list.append(pj)
Pe = sum(pj ** 2 for pj in pj_list)
Pi_list = []
for row in ratings_matrix:
counts = [row.count(j) for j in range(n_categories)]
pi = (sum(c * (c - 1) for c in counts)) / (n_raters * (n_raters - 1))
Pi_list.append(pi)
Po = sum(Pi_list) / n_items
return (Po - Pe) / (1 - Pe)
# 示例：计算两位标注者的一致性
rater_a = [4, 5, 3, 4, 2, 5, 3, 4, 4, 5]
rater_b = [4, 4, 3, 5, 2, 5, 4, 4, 3, 5]
kappa = IAA.cohens_kappa(rater_a, rater_b)
print(f"Cohen's Kappa: {kappa:.3f}")
if kappa >= 0.8:
print("✅ 一致性优秀，可以信任标注结果")
elif kappa >= 0.6:
print("⚠️ 一致性良好，建议加强标注指南说明")
else:
print("❌ 一致性不足，需要重新培训标注者")

众包平台对比

平台	专业度	规模	质量控制	成本/条	适合场景
Amazon MTurk	低	极高	黄金任务	$0.01–0.05	简单分类
Scale AI	高	高	专家审核	$0.5–5.0	复杂标注
Labelbox	中	中	内置审核	$0.1–1.0	多模态
Surge AI	中高	中	地区筛选	$0.2–2.0	NLP任务
内部团队	很高	低	直接管控	$5–50	敏感/核心

标注质量控制流程

graph LR A[任务分发] --> B[3人独立标注] B --> C{IAA ≥ 0.7?} C -->|是| D[多数投票取结果] C -->|否| E[专家仲裁] D --> F[写入数据集] E --> F style C fill:#fff9c4,stroke:#f9a825,stroke-width:2px style E fill:#ffcdd2,stroke:#c62828,stroke-width:2px style F fill:#c8e6c9,stroke:#43a047,stroke-width:2px

本章小结

至少3人独立标注——避免单人偏见影响结果
IAA ≥ 0.7（Kappa）——一致性低于此值结果不可信
标注指南要包含反例——示例越具体，一致性越高
对照黄金样本校准——定期插入已知答案来检测标注漂移
众包适合量，专家适合质——高风险场景（医疗/法律）必须专家标注

下一章：多模型对比评估