Evals 框架深度解析
标准化评估框架能让团队把精力放在评估设计上,而不是重复造轮子。本章对比主流框架,并展示如何将它们整合进实际工程流程。
主流框架对比
graph LR
A[Evals 生态] --> B[OpenAI Evals]
A --> C[HELM]
A --> D[DeepEval]
A --> E[Ragas]
A --> F[LangSmith]
B --> B1[YAML驱动
内置多分类器] C --> C1[学术基准
42+场景] D --> D1[单元测试风格
Python原生] E --> E1[RAG专项
4大指标] F --> F1[LangChain生态
可视化追踪] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style D fill:#c8e6c9,stroke:#43a047,stroke-width:2px style E fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
内置多分类器] C --> C1[学术基准
42+场景] D --> D1[单元测试风格
Python原生] E --> E1[RAG专项
4大指标] F --> F1[LangChain生态
可视化追踪] style A fill:#ede7f6,stroke:#5e35b1,stroke-width:2px style D fill:#c8e6c9,stroke:#43a047,stroke-width:2px style E fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
框架详细对比
| 维度 | OpenAI Evals | HELM | DeepEval | Ragas |
|---|---|---|---|---|
| 定位 | 通用评估 | 学术基准 | 应用测试 | RAG专项 |
| 接入方式 | CLI / Python | CLI | Python pytest | Python |
| 自定义难度 | 中 | 高 | 低 | 中 |
| RAG支持 | 弱 | 弱 | 强 | 极强 |
| 可视化 | 基础 | 中等 | 基础 | 中等 |
| 社区活跃度 | 高 | 中 | 高 | 高 |
| License | MIT | Apache 2.0 | Apache 2.0 | MIT |
DeepEval 实战集成
# pip install deepeval
from deepeval import assert_test
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRecallMetric,
HallucinationMetric,
)
from deepeval.test_case import LLMTestCase
import pytest
class LLMEvalSuite:
"""基于 DeepEval 的评估套件封装"""
def __init__(self, model_name: str = "gpt-4o"):
self.model_name = model_name
self.metrics = {
"relevancy": AnswerRelevancyMetric(threshold=0.7, model=model_name),
"faithfulness": FaithfulnessMetric(threshold=0.7, model=model_name),
"recall": ContextualRecallMetric(threshold=0.7, model=model_name),
"hallucination": HallucinationMetric(threshold=0.5, model=model_name),
}
def create_rag_test_case(
self,
query: str,
actual_output: str,
expected_output: str,
retrieval_context: list[str],
) -> LLMTestCase:
"""创建 RAG 测试用例"""
return LLMTestCase(
input=query,
actual_output=actual_output,
expected_output=expected_output,
retrieval_context=retrieval_context,
)
def run_batch(self, test_cases: list[LLMTestCase]) -> dict:
"""批量运行评估"""
results = []
for test_case in test_cases:
case_results = {}
for metric_name, metric in self.metrics.items():
try:
metric.measure(test_case)
case_results[metric_name] = {
"score": metric.score,
"passed": metric.is_successful(),
"reason": metric.reason,
}
except Exception as e:
case_results[metric_name] = {"error": str(e)}
results.append({
"input": test_case.input,
"metrics": case_results,
})
return self._aggregate_results(results)
def _aggregate_results(self, results: list[dict]) -> dict:
"""汇总评估结果"""
from collections import defaultdict
metric_scores: dict[str, list[float]] = defaultdict(list)
for r in results:
for metric_name, metric_result in r["metrics"].items():
if "score" in metric_result:
metric_scores[metric_name].append(metric_result["score"])
return {
"total_cases": len(results),
"avg_scores": {
k: round(sum(v) / len(v), 3)
for k, v in metric_scores.items() if v
},
"pass_rates": {
k: f"{sum(1 for v in vals if v >= 0.7) / len(vals):.1%}"
for k, vals in metric_scores.items() if vals
},
}
# Ragas 集成示例
def evaluate_rag_with_ragas(
questions: list[str],
answers: list[str],
contexts: list[list[str]],
ground_truths: list[str],
) -> dict:
"""
使用 Ragas 评估 RAG 系统四大指标
- Faithfulness: 答案是否忠于检索内容(不幻觉)
- Answer Relevancy: 答案是否切题
- Context Precision: 检索内容的精准率
- Context Recall: 检索内容的召回率
"""
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths,
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
return result.to_pandas().mean().to_dict()
# pytest 集成——作为 CI 门禁
@pytest.mark.parametrize("test_case", [
{
"query": "Python 的 GIL 是什么?",
"output": "GIL(全局解释器锁)是 CPython 中防止多线程同时执行 Python 字节码的机制。",
"context": ["CPython 的 GIL 是一个互斥锁,确保同一时刻只有一个线程执行 Python 代码。"],
"min_relevancy": 0.7,
}
])
def test_answer_quality(test_case):
"""CI 中自动运行的质量门禁测试"""
metric = AnswerRelevancyMetric(threshold=test_case["min_relevancy"])
case = LLMTestCase(
input=test_case["query"],
actual_output=test_case["output"],
retrieval_context=test_case["context"],
)
assert_test(case, [metric])
框架选型建议
graph TB
A{你的场景?} --> B[RAG系统]
A --> C[通用对话]
A --> D[代码生成]
A --> E[学术研究]
B --> B1[首选 Ragas
备选 DeepEval] C --> C1[首选 DeepEval
备选 OpenAI Evals] D --> D1[首选 HumanEval
备选 SWE-bench] E --> E1[首选 HELM
备选 BIG-bench] style B1 fill:#c8e6c9,stroke:#43a047,stroke-width:2px style C1 fill:#c8e6c9,stroke:#43a047,stroke-width:2px
备选 DeepEval] C --> C1[首选 DeepEval
备选 OpenAI Evals] D --> D1[首选 HumanEval
备选 SWE-bench] E --> E1[首选 HELM
备选 BIG-bench] style B1 fill:#c8e6c9,stroke:#43a047,stroke-width:2px style C1 fill:#c8e6c9,stroke:#43a047,stroke-width:2px
本章小结
- Ragas 是 RAG 系统的首选——四大指标开箱即用
- DeepEval 适合 Python 团队——pytest 风格降低接入门槛
- OpenAI Evals 适合大规模批量评估——YAML 配置管理方便
- HELM 用于横向学术基准对比——42+ 场景标准化测试
- 不要自己造轮子——先用现有框架,再针对性扩展
下一章:企业级评估体系构建