评估维度设计框架
好的评估框架不是随机选几个指标——它需要从业务目标出发,系统拆解需要衡量的每一个维度。
维度设计思路
graph TB
A[业务目标] --> B{拆解维度}
B --> C[输出质量
Output Quality] B --> D[可靠性
Reliability] B --> E[安全合规
Safety] B --> F[用户体验
UX] B --> G[效率成本
Efficiency] C --> C1[准确性/相关性/一致性] D --> D1[稳定性/错误率/幻觉率] E --> E1[有害内容/数据隐私/合规] F --> F1[延迟/流畅度/可读性] G --> G1[Token成本/吞吐量/缓存命中] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style C fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style E fill:#ffcdd2,stroke:#c62828,stroke-width:2px
Output Quality] B --> D[可靠性
Reliability] B --> E[安全合规
Safety] B --> F[用户体验
UX] B --> G[效率成本
Efficiency] C --> C1[准确性/相关性/一致性] D --> D1[稳定性/错误率/幻觉率] E --> E1[有害内容/数据隐私/合规] F --> F1[延迟/流畅度/可读性] G --> G1[Token成本/吞吐量/缓存命中] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style C fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style E fill:#ffcdd2,stroke:#c62828,stroke-width:2px
维度权重建模
from dataclasses import dataclass, field
from enum import Enum
class DimensionType(Enum):
QUALITY = "quality"
RELIABILITY = "reliability"
SAFETY = "safety"
EFFICIENCY = "efficiency"
UX = "user_experience"
@dataclass
class EvalDimension:
"""评估维度定义"""
name: str
dimension_type: DimensionType
weight: float # 0.0–1.0,此维度在总分中的权重
metrics: list[str] # 具体指标列表
threshold: float # 最低合格分(低于此分直接不通过)
def weighted_score(self, raw_score: float) -> float:
"""计算加权得分"""
return raw_score * self.weight
@dataclass
class EvalFramework:
"""评估框架"""
name: str
use_case: str
dimensions: list[EvalDimension] = field(default_factory=list)
def add_dimension(self, dim: EvalDimension) -> None:
if abs(sum(d.weight for d in self.dimensions) + dim.weight - 1.0) > 0.01:
pass # 允许逐步添加,最终校验
self.dimensions.append(dim)
@property
def total_weight(self) -> float:
return sum(d.weight for d in self.dimensions)
@property
def is_balanced(self) -> bool:
return abs(self.total_weight - 1.0) < 0.01
def evaluate(self, scores: dict[str, float]) -> dict:
"""计算总体评估结果"""
total = 0.0
dimension_results = {}
failed_thresholds = []
for dim in self.dimensions:
raw = scores.get(dim.name, 0.0)
if raw < dim.threshold:
failed_thresholds.append(
f"{dim.name} ({raw:.2f} < {dim.threshold:.2f})"
)
weighted = dim.weighted_score(raw)
dimension_results[dim.name] = {
"raw_score": raw,
"weight": dim.weight,
"weighted_score": weighted,
"passed_threshold": raw >= dim.threshold,
}
total += weighted
return {
"total_score": round(total, 4),
"passed": len(failed_thresholds) == 0,
"failed_thresholds": failed_thresholds,
"dimension_results": dimension_results,
}
# 示例:客服机器人评估框架
customer_service_framework = EvalFramework(
name="客服机器人评估",
use_case="B2C 客服对话",
)
customer_service_framework.add_dimension(EvalDimension(
"answer_accuracy", DimensionType.QUALITY, weight=0.35,
metrics=["factual_correctness", "relevance", "completeness"],
threshold=0.70,
))
customer_service_framework.add_dimension(EvalDimension(
"safety", DimensionType.SAFETY, weight=0.25,
metrics=["harmful_content_rate", "pii_leakage_rate"],
threshold=0.95, # 安全阈值更高
))
customer_service_framework.add_dimension(EvalDimension(
"tone", DimensionType.UX, weight=0.20,
metrics=["politeness", "clarity", "empathy"],
threshold=0.60,
))
customer_service_framework.add_dimension(EvalDimension(
"efficiency", DimensionType.EFFICIENCY, weight=0.20,
metrics=["token_cost", "latency_p95"],
threshold=0.50,
))
print(f"权重总和: {customer_service_framework.total_weight:.2f}")
print(f"权重均衡: {customer_service_framework.is_balanced}")
# 评估一个模型
result = customer_service_framework.evaluate({
"answer_accuracy": 0.82,
"safety": 0.97,
"tone": 0.75,
"efficiency": 0.68,
})
print(f"总分: {result['total_score']:.4f}")
print(f"通过: {result['passed']}")
不同场景的推荐维度权重
| 场景 | 准确性 | 安全性 | 用户体验 | 效率 |
|---|---|---|---|---|
| 客服机器人 | 35% | 25% | 20% | 20% |
| 代码生成 | 45% | 15% | 15% | 25% |
| 内容创作 | 30% | 30% | 30% | 10% |
| 教育问答 | 40% | 30% | 20% | 10% |
| 医疗咨询 | 35% | 45% | 15% | 5% |
| 金融分析 | 45% | 35% | 10% | 10% |
维度设计原则
graph LR
A[好的评估维度] --> B[业务相关
Business-aligned] A --> C[可量化
Measurable] A --> D[无重叠
Non-overlapping] A --> E[覆盖完整
Comprehensive] style A fill:#c8e6c9,stroke:#43a047,stroke-width:2px style B fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style C fill:#fff9c4,stroke:#f9a825,stroke-width:2px
Business-aligned] A --> C[可量化
Measurable] A --> D[无重叠
Non-overlapping] A --> E[覆盖完整
Comprehensive] style A fill:#c8e6c9,stroke:#43a047,stroke-width:2px style B fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style C fill:#fff9c4,stroke:#f9a825,stroke-width:2px
本章小结
- 从业务目标拆解维度——不同场景权重截然不同
- 为每个维度设定阈值——低于阈值直接不通过
- 权重总和要为 1.0——确保评估结果可比较
- 安全维度阈值要高——医疗/金融场景安全权重 ≥30%
- 定期复审维度设计——业务目标变化时更新框架
下一章:评估数据集构建