2 min read363 words

LLM安全与对齐

LLM 的能力越强，安全风险越大。理解 RLHF、红队测试、对齐策略是使用和部署 LLM 的必修课。

LLM 安全全景

graph TB A[LLM 安全] --> B[对齐 Alignment] A --> C[安全防护] A --> D[评估测试] B --> B1[RLHF] B --> B2[Constitutional AI] B --> B3[DPO] C --> C1[输入过滤] C --> C2[输出审核] C --> C3[防越狱] D --> D1[红队测试] D --> D2[安全评测基准] D --> D3[持续监控] style B fill:#c8e6c9,stroke:#43a047,stroke-width:2px style C fill:#fff9c4,stroke:#f9a825,stroke-width:2px style D fill:#e3f2fd,stroke:#1565c0,stroke-width:2px

对齐技术演进

技术	提出时间	核心思路	代表模型
SFT	2020	人工标注优质回答	InstructGPT
RLHF	2022	人类偏好训练奖励模型	ChatGPT
Constitutional AI	2022	AI 自我批评+修正	Claude
DPO	2023	直接优化偏好，跳过奖励模型	Zephyr
ORPO	2024	SFT+对齐一步完成	新兴模型

RLHF 流程模拟

from dataclasses import dataclass
@dataclass
class Response:
text: str
safety_score: float   # 0-1，安全分
helpful_score: float  # 0-1，有用分
@property
def reward(self) -> float:
"""综合奖励 = 安全权重更高"""
return self.safety_score * 0.6 + self.helpful_score * 0.4
class SimpleRLHF:
"""RLHF 对齐流程简化演示"""
def rank_responses(
self, responses: list[Response]
) -> list[Response]:
"""人类排序：安全且有用的排前面"""
return sorted(responses, key=lambda r: r.reward, reverse=True)
def compute_preference_pairs(
self, ranked: list[Response]
) -> list[tuple[Response, Response]]:
"""生成偏好对 (chosen, rejected)"""
pairs = []
for i in range(len(ranked)):
for j in range(i + 1, len(ranked)):
pairs.append((ranked[i], ranked[j]))
return pairs
def should_update(
self, chosen: Response, rejected: Response
) -> dict:
"""判断是否需要更新模型"""
margin = chosen.reward - rejected.reward
return {
"chosen": chosen.text[:50],
"rejected": rejected.text[:50],
"reward_margin": round(margin, 3),
"update": margin > 0.1,  # 差异足够大才更新
}
# 演示：同一请求的多个回答
responses = [
Response("这是一个关于机器学习的详细解释...", 0.95, 0.90),
Response("我不太确定，但大概是...", 0.90, 0.40),
Response("这个问题有争议，我来分析正反两面...", 0.85, 0.85),
]
rlhf = SimpleRLHF()
ranked = rlhf.rank_responses(responses)
print("排名结果:")
for i, r in enumerate(ranked, 1):
print(f"  {i}. reward={r.reward:.3f} | {r.text[:40]}")

常见 LLM 安全风险

graph TB A[安全风险] --> B[幻觉] A --> C[有害输出] A --> D[隐私泄露] A --> E[越狱攻击] B --> B1["编造不存在的论文/事实"] C --> C1[生成歧视/暴力内容] D --> D1[泄露训练数据中的个人信息] E --> E1[绕过安全限制] style B fill:#fff9c4,stroke:#f9a825,stroke-width:2px style C fill:#ffcdd2,stroke:#e53935,stroke-width:2px style D fill:#e1bee7,stroke:#8e24aa,stroke-width:2px style E fill:#ffccbc,stroke:#d84315,stroke-width:2px

风险	频率	影响	缓解策略
幻觉	高	中	RAG + 事实检验 + 引用来源
有害输出	低	高	安全护栏 + 输出过滤
隐私泄露	中	高	数据清洗 + 差分隐私
越狱攻击	中	高	多层防御 + 持续红队
偏见放大	中	中	去偏训练 + 评估基准

安全实践清单

SAFETY_CHECKLIST = {
"部署前": [
"红队测试（至少 100 个对抗样本）",
"安全评测基准（TruthfulQA, BBQ, ToxiGen）",
"输入/输出过滤器配置",
"速率限制和用户认证",
],
"运行中": [
"实时监控异常输出",
"用户反馈收集机制",
"自动告警（毒性分数 > 阈值）",
"日志审计（可追溯每次生成）",
],
"持续改进": [
"每月红队测试更新",
"安全事件复盘",
"对齐训练数据迭代",
"社区漏洞报告机制",
],
}
for phase, items in SAFETY_CHECKLIST.items():
print(f"\n【{phase}】")
for i, item in enumerate(items, 1):
print(f"  {i}. {item}")

本章小结

RLHF 是当前主流——通过人类偏好排序训练奖励模型来对齐
安全 > 能力——奖励函数中安全权重应高于有用性
DPO 更简单——直接优化偏好，不需要单独的奖励模型
幻觉最常见——用 RAG + 引用来源缓解，不能完全消除
红队必不可少——部署前至少 100 个对抗样本测试

下一章：API 集成实战