数据隐私与合规
LLM 应用处理大量用户数据,隐私与合规是生产环境的硬性要求。一次数据泄露可能导致数百万罚款。
隐私威胁模型
graph TB
A[隐私威胁] --> B[输入泄露]
A --> C[训练数据提取]
A --> D[Prompt 注入]
A --> E[日志泄露]
B --> B1[用户 PII 发送给第三方 API]
C --> C1[模型记忆训练数据中的隐私]
D --> D1[恶意指令提取系统 Prompt]
E --> E1[日志中包含敏感信息]
style A fill:#ffebee,stroke:#c62828,stroke-width:3px
style B fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style D fill:#fff3e0,stroke:#f57c00,stroke-width:2px
PII 检测与脱敏
"""
PII 检测与脱敏框架
"""
import re
from dataclasses import dataclass
from enum import Enum
from typing import Any
class PIIType(Enum):
EMAIL = "email"
PHONE = "phone"
ID_CARD = "id_card"
CREDIT_CARD = "credit_card"
ADDRESS = "address"
NAME = "name"
BANK_ACCOUNT = "bank_account"
@dataclass
class PIIDetection:
"""PII 检测结果"""
pii_type: PIIType
original: str
start: int
end: int
confidence: float
class PIIDetector:
"""PII 检测器"""
PATTERNS: dict[PIIType, re.Pattern] = {
PIIType.EMAIL: re.compile(
r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
),
PIIType.PHONE: re.compile(
r'\b(?:1[3-9]\d{9}|\+?\d{1,3}[-.\s]?\d{3,4}[-.\s]?\d{4})\b'
),
PIIType.CREDIT_CARD: re.compile(
r'\b(?:\d{4}[-\s]?){3}\d{4}\b'
),
PIIType.ID_CARD: re.compile(
r'\b\d{17}[\dXx]\b'
),
}
def detect(self, text: str) -> list[PIIDetection]:
"""检测文本中的 PII"""
detections = []
for pii_type, pattern in self.PATTERNS.items():
for match in pattern.finditer(text):
detections.append(PIIDetection(
pii_type=pii_type,
original=match.group(),
start=match.start(),
end=match.end(),
confidence=0.9,
))
return detections
class PIISanitizer:
"""PII 脱敏器"""
MASK_MAP = {
PIIType.EMAIL: lambda s: s[:2] + "***@" + s.split("@")[-1],
PIIType.PHONE: lambda s: s[:3] + "****" + s[-4:],
PIIType.CREDIT_CARD: lambda s: "****-****-****-" + s[-4:],
PIIType.ID_CARD: lambda s: s[:6] + "********" + s[-4:],
}
def __init__(self):
self.detector = PIIDetector()
def sanitize(self, text: str) -> tuple[str, list[PIIDetection]]:
"""脱敏处理"""
detections = self.detector.detect(text)
result = text
# 从后往前替换,避免位移
for det in sorted(detections, key=lambda d: d.start, reverse=True):
mask_fn = self.MASK_MAP.get(det.pii_type, lambda s: "***")
masked = mask_fn(det.original)
result = result[:det.start] + masked + result[det.end:]
return result, detections
数据流安全网关
"""
数据安全网关 — LLM 请求前后过滤
"""
from dataclasses import dataclass, field
from enum import Enum
class SecurityAction(Enum):
ALLOW = "allow"
SANITIZE = "sanitize" # 脱敏后放行
BLOCK = "block" # 拒绝
@dataclass
class SecurityPolicy:
"""安全策略"""
allow_external_api: bool = True
pii_action: SecurityAction = SecurityAction.SANITIZE
max_input_length: int = 10000
blocked_patterns: list[str] = field(default_factory=list)
class DataSecurityGateway:
"""数据安全网关"""
def __init__(self, policy: SecurityPolicy):
self.policy = policy
self.sanitizer = PIISanitizer()
self._blocked_count = 0
def pre_process(self, user_input: str) -> tuple[str, SecurityAction]:
"""请求前处理"""
# 长度检查
if len(user_input) > self.policy.max_input_length:
self._blocked_count += 1
return "", SecurityAction.BLOCK
# 黑名单模式检查
for pattern in self.policy.blocked_patterns:
if pattern.lower() in user_input.lower():
self._blocked_count += 1
return "", SecurityAction.BLOCK
# PII 处理
if self.policy.pii_action == SecurityAction.SANITIZE:
sanitized, detections = self.sanitizer.sanitize(user_input)
if detections:
return sanitized, SecurityAction.SANITIZE
elif self.policy.pii_action == SecurityAction.BLOCK:
detections = self.sanitizer.detector.detect(user_input)
if detections:
self._blocked_count += 1
return "", SecurityAction.BLOCK
return user_input, SecurityAction.ALLOW
def post_process(self, model_output: str) -> str:
"""响应后处理 — 确保模型不泄露 PII"""
sanitized, _ = self.sanitizer.sanitize(model_output)
return sanitized
合规要求对比
| 法规 | 适用范围 | 核心要求 | LLM 影响 |
|---|---|---|---|
| GDPR | 欧盟用户 | 数据最小化、删除权、处理记录 | 不可将欧盟数据发至欧盟外 API |
| CCPA | 加州居民 | 知情权、删除权、不卖数据 | 需声明 LLM 数据用途 |
| 个人信息保护法 | 中国公民 | 同意权、数据本地化 | 数据不出境,需本地部署 |
| HIPAA | 美国医疗 | PHI 保护 | 不可将 PHI 发给通用 API |
合规架构
graph TB
A[合规架构] --> B[数据本地化]
A --> C[访问控制]
A --> D[数据生命周期]
B --> B1[私有部署模型]
B --> B2[数据不出境]
C --> C1[RBAC 权限控制]
C --> C2[API Key 管理]
D --> D1[自动过期删除]
D --> D2[审计日志保留]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
style B fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
本章小结
| 主题 | 要点 |
|---|---|
| PII 检测 | 正则模式匹配 + 可扩展检测器 |
| 安全网关 | 请求前脱敏/拦截 + 响应后过滤 |
| 数据本地化 | 高敏感场景私有部署,数据不出境 |
| 合规差异 | GDPR/CCPA/个保法/HIPAA 各有侧重 |
下一章:LLM 生产案例集