评估维度设计框架
High Contrast
Dark Mode
Light Mode
Sepia
Forest
1 min read257 words

评估维度设计框架

好的评估框架不是随机选几个指标——它需要从业务目标出发,系统拆解需要衡量的每一个维度。

维度设计思路

graph TB A[业务目标] --> B{拆解维度} B --> C[输出质量
Output Quality] B --> D[可靠性
Reliability] B --> E[安全合规
Safety] B --> F[用户体验
UX] B --> G[效率成本
Efficiency] C --> C1[准确性/相关性/一致性] D --> D1[稳定性/错误率/幻觉率] E --> E1[有害内容/数据隐私/合规] F --> F1[延迟/流畅度/可读性] G --> G1[Token成本/吞吐量/缓存命中] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style C fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style E fill:#ffcdd2,stroke:#c62828,stroke-width:2px

维度权重建模

from dataclasses import dataclass, field
from enum import Enum
class DimensionType(Enum):
QUALITY = "quality"
RELIABILITY = "reliability"
SAFETY = "safety"
EFFICIENCY = "efficiency"
UX = "user_experience"
@dataclass
class EvalDimension:
"""评估维度定义"""
name: str
dimension_type: DimensionType
weight: float          # 0.0–1.0,此维度在总分中的权重
metrics: list[str]     # 具体指标列表
threshold: float       # 最低合格分(低于此分直接不通过)
def weighted_score(self, raw_score: float) -> float:
"""计算加权得分"""
return raw_score * self.weight
@dataclass
class EvalFramework:
"""评估框架"""
name: str
use_case: str
dimensions: list[EvalDimension] = field(default_factory=list)
def add_dimension(self, dim: EvalDimension) -> None:
if abs(sum(d.weight for d in self.dimensions) + dim.weight - 1.0) > 0.01:
pass  # 允许逐步添加,最终校验
self.dimensions.append(dim)
@property
def total_weight(self) -> float:
return sum(d.weight for d in self.dimensions)
@property
def is_balanced(self) -> bool:
return abs(self.total_weight - 1.0) < 0.01
def evaluate(self, scores: dict[str, float]) -> dict:
"""计算总体评估结果"""
total = 0.0
dimension_results = {}
failed_thresholds = []
for dim in self.dimensions:
raw = scores.get(dim.name, 0.0)
if raw < dim.threshold:
failed_thresholds.append(
f"{dim.name} ({raw:.2f} < {dim.threshold:.2f})"
)
weighted = dim.weighted_score(raw)
dimension_results[dim.name] = {
"raw_score": raw,
"weight": dim.weight,
"weighted_score": weighted,
"passed_threshold": raw >= dim.threshold,
}
total += weighted
return {
"total_score": round(total, 4),
"passed": len(failed_thresholds) == 0,
"failed_thresholds": failed_thresholds,
"dimension_results": dimension_results,
}
# 示例:客服机器人评估框架
customer_service_framework = EvalFramework(
name="客服机器人评估",
use_case="B2C 客服对话",
)
customer_service_framework.add_dimension(EvalDimension(
"answer_accuracy", DimensionType.QUALITY, weight=0.35,
metrics=["factual_correctness", "relevance", "completeness"],
threshold=0.70,
))
customer_service_framework.add_dimension(EvalDimension(
"safety", DimensionType.SAFETY, weight=0.25,
metrics=["harmful_content_rate", "pii_leakage_rate"],
threshold=0.95,  # 安全阈值更高
))
customer_service_framework.add_dimension(EvalDimension(
"tone", DimensionType.UX, weight=0.20,
metrics=["politeness", "clarity", "empathy"],
threshold=0.60,
))
customer_service_framework.add_dimension(EvalDimension(
"efficiency", DimensionType.EFFICIENCY, weight=0.20,
metrics=["token_cost", "latency_p95"],
threshold=0.50,
))
print(f"权重总和: {customer_service_framework.total_weight:.2f}")
print(f"权重均衡: {customer_service_framework.is_balanced}")
# 评估一个模型
result = customer_service_framework.evaluate({
"answer_accuracy": 0.82,
"safety": 0.97,
"tone": 0.75,
"efficiency": 0.68,
})
print(f"总分: {result['total_score']:.4f}")
print(f"通过: {result['passed']}")

不同场景的推荐维度权重

场景 准确性 安全性 用户体验 效率
客服机器人 35% 25% 20% 20%
代码生成 45% 15% 15% 25%
内容创作 30% 30% 30% 10%
教育问答 40% 30% 20% 10%
医疗咨询 35% 45% 15% 5%
金融分析 45% 35% 10% 10%

维度设计原则

graph LR A[好的评估维度] --> B[业务相关
Business-aligned] A --> C[可量化
Measurable] A --> D[无重叠
Non-overlapping] A --> E[覆盖完整
Comprehensive] style A fill:#c8e6c9,stroke:#43a047,stroke-width:2px style B fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style C fill:#fff9c4,stroke:#f9a825,stroke-width:2px

本章小结

下一章:评估数据集构建