人工评估与众包标注
High Contrast
Dark Mode
Light Mode
Sepia
Forest
2 min read324 words

人工评估与众包标注

自动化评估快速高效,但对微妙的质量问题(语气、逻辑连贯、文化得体性)往往力不从心。人工评估是黄金标准,也是所有自动化指标的校准来源。

人工评估体系架构

graph TB A[人工评估体系] --> B[内部专家评估
Internal Expert] A --> C[众包平台评估
Crowdsource] A --> D[用户反馈
RLHF Signal] B --> B1[高专业度
低规模
高成本] C --> C1[中专业度
高规模
中成本] D --> D1[低专业度
超高规模
低成本] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style B fill:#c8e6c9,stroke:#43a047,stroke-width:2px style D fill:#fff9c4,stroke:#f9a825,stroke-width:2px

标注任务设计

from dataclasses import dataclass
from enum import Enum
from typing import Optional
class AnnotationMode(Enum):
LIKERT_SCALE = "likert"       # 李克特量表(1–5分)
COMPARATIVE = "comparative"   # A vs B 对比
BINARY = "binary"             # pass/fail
RUBRIC = "rubric"             # 多维逐项评分
@dataclass
class AnnotationTask:
"""标注任务定义"""
task_id: str
prompt: str
response_a: str
response_b: Optional[str]     # 对比评估时填写
annotation_mode: AnnotationMode
rubric: dict[str, str]        # 评分细则
@property
def is_comparative(self) -> bool:
return self.annotation_mode == AnnotationMode.COMPARATIVE and self.response_b
@dataclass
class AnnotationResult:
"""标注结果"""
task_id: str
annotator_id: str
score: float | str            # 数字评分 or "A wins"/"B wins"/"tie"
rubric_scores: dict[str, float]
comment: str
time_spent_sec: int
class IAA:
"""标注一致性 (Inter-Annotator Agreement) 计算器"""
@staticmethod
def cohens_kappa(ratings_a: list[int], ratings_b: list[int]) -> float:
"""
计算 Cohen's Kappa 标注一致性系数
Kappa > 0.8: 几乎完全一致 (Almost Perfect)
Kappa 0.6–0.8: 较高一致 (Substantial)
Kappa 0.4–0.6: 中等一致 (Moderate)
Kappa < 0.4: 较低,需重新设计标注指南
"""
assert len(ratings_a) == len(ratings_b), "评分数量需相同"
n = len(ratings_a)
categories = sorted(set(ratings_a) | set(ratings_b))
# 实测一致性
po = sum(1 for a, b in zip(ratings_a, ratings_b) if a == b) / n
# 期望一致性
pe = 0.0
for cat in categories:
pa = ratings_a.count(cat) / n
pb = ratings_b.count(cat) / n
pe += pa * pb
if pe == 1.0:
return 1.0
return (po - pe) / (1 - pe)
@staticmethod
def fleiss_kappa(ratings_matrix: list[list[int]], n_categories: int) -> float:
"""
Fleiss' Kappa:多标注者参与时使用
ratings_matrix: [样本][标注者] 二维评分矩阵
"""
n_items = len(ratings_matrix)
n_raters = len(ratings_matrix[0])
# 每个样本各类别被选择次数
pj_list = []
for j in range(n_categories):
pj = sum(row.count(j) for row in ratings_matrix) / (n_items * n_raters)
pj_list.append(pj)
Pe = sum(pj ** 2 for pj in pj_list)
Pi_list = []
for row in ratings_matrix:
counts = [row.count(j) for j in range(n_categories)]
pi = (sum(c * (c - 1) for c in counts)) / (n_raters * (n_raters - 1))
Pi_list.append(pi)
Po = sum(Pi_list) / n_items
return (Po - Pe) / (1 - Pe)
# 示例:计算两位标注者的一致性
rater_a = [4, 5, 3, 4, 2, 5, 3, 4, 4, 5]
rater_b = [4, 4, 3, 5, 2, 5, 4, 4, 3, 5]
kappa = IAA.cohens_kappa(rater_a, rater_b)
print(f"Cohen's Kappa: {kappa:.3f}")
if kappa >= 0.8:
print("✅ 一致性优秀,可以信任标注结果")
elif kappa >= 0.6:
print("⚠️ 一致性良好,建议加强标注指南说明")
else:
print("❌ 一致性不足,需要重新培训标注者")

众包平台对比

平台 专业度 规模 质量控制 成本/条 适合场景
Amazon MTurk 极高 黄金任务 $0.01–0.05 简单分类
Scale AI 专家审核 $0.5–5.0 复杂标注
Labelbox 内置审核 $0.1–1.0 多模态
Surge AI 中高 地区筛选 $0.2–2.0 NLP任务
内部团队 很高 直接管控 $5–50 敏感/核心

标注质量控制流程

graph LR A[任务分发] --> B[3人独立标注] B --> C{IAA ≥ 0.7?} C -->|是| D[多数投票取结果] C -->|否| E[专家仲裁] D --> F[写入数据集] E --> F style C fill:#fff9c4,stroke:#f9a825,stroke-width:2px style E fill:#ffcdd2,stroke:#c62828,stroke-width:2px style F fill:#c8e6c9,stroke:#43a047,stroke-width:2px

本章小结

下一章:多模型对比评估