成本优化与监控
多模态 AI 的成本结构
多模态应用的成本远高于纯文本 LLM——图像、视频和音频各有其资源消耗模式。有效的成本管理是规模化落地的前提。
graph TB
A[成本结构] --> B[推理成本]
A --> C[存储成本]
A --> D[网络成本]
A --> E[GPU 成本]
B --> B1[API 调用费]
B --> B2[Token 消耗]
B --> B3[图像/视频处理费]
C --> C1[原始数据存储]
C --> C2[向量索引存储]
C --> C3[结果缓存]
D --> D1[数据传输]
D --> D2[CDN 分发]
E --> E1[自部署 GPU 租赁]
E --> E2[弹性扩缩容]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
style B fill:#ffcdd2,stroke:#c62828,stroke-width:2px
成本优化策略
1. 模型路由 — 按任务复杂度分发
"""
智能模型路由 — 降低多模态成本
"""
from dataclasses import dataclass
from enum import Enum
class TaskComplexity(Enum):
SIMPLE = "simple" # 简单分类、OCR
MEDIUM = "medium" # 场景理解、问答
COMPLEX = "complex" # 多步推理、创意生成
@dataclass
class ModelOption:
"""模型选项"""
name: str
cost_per_image: float # USD
quality_score: float # 0-1
latency_ms: int
class CostAwareRouter:
"""成本感知路由器"""
def __init__(self):
self.models = {
TaskComplexity.SIMPLE: ModelOption(
"Qwen-VL-2 7B", 0.001, 0.75, 100
),
TaskComplexity.MEDIUM: ModelOption(
"Gemini 2.0 Flash", 0.003, 0.88, 200
),
TaskComplexity.COMPLEX: ModelOption(
"GPT-4o", 0.01, 0.95, 400
),
}
def classify_task(self, prompt: str, has_image: bool, has_video: bool) -> TaskComplexity:
"""分类任务复杂度"""
complexity_signals = 0
if has_video:
complexity_signals += 2
if has_image:
complexity_signals += 1
if any(w in prompt for w in ["分析", "比较", "推理", "为什么"]):
complexity_signals += 1
if len(prompt) > 200:
complexity_signals += 1
if complexity_signals >= 3:
return TaskComplexity.COMPLEX
elif complexity_signals >= 1:
return TaskComplexity.MEDIUM
return TaskComplexity.SIMPLE
def route(self, prompt: str, has_image: bool = True, has_video: bool = False) -> ModelOption:
"""路由到合适的模型"""
complexity = self.classify_task(prompt, has_image, has_video)
model = self.models[complexity]
return model
def estimate_cost(self, daily_requests: int, distribution: dict = None) -> dict:
"""估算日成本"""
if distribution is None:
distribution = {
TaskComplexity.SIMPLE: 0.5,
TaskComplexity.MEDIUM: 0.35,
TaskComplexity.COMPLEX: 0.15,
}
total = 0
breakdown = {}
for complexity, ratio in distribution.items():
model = self.models[complexity]
count = int(daily_requests * ratio)
cost = count * model.cost_per_image
total += cost
breakdown[model.name] = {
"请求数": count,
"单价": f"${model.cost_per_image}",
"小计": f"${cost:.2f}",
}
breakdown["total"] = f"${total:.2f}/日 (${total*30:.2f}/月)"
return breakdown
# 成本估算示例
router = CostAwareRouter()
cost = router.estimate_cost(daily_requests=10000)
print("日成本估算 (10,000 请求/天):")
for k, v in cost.items():
if isinstance(v, dict):
print(f" {k}: {v['请求数']}次 × {v['单价']} = {v['小计']}")
else:
print(f" 总计: {v}")
2. 缓存策略
| 缓存层 | 命中场景 | 节省比例 | TTL |
|---|---|---|---|
| 精确缓存 | 完全相同的图片+Prompt | 100% | 24h |
| 语义缓存 | 相似 Prompt (余弦>0.95) | 100% | 12h |
| 结果模板 | 同类图片 + 同模板 Prompt | 80% | 6h |
| 部分缓存 | OCR 结果复用 | 40% | 48h |
3. 监控仪表盘
graph LR
subgraph 关键指标
A[请求量 QPS]
B[成本 $/天]
C[延迟 P95]
D[错误率 %]
end
subgraph 告警规则
E[成本超预算 >120%]
F[延迟飙升 P95>3s]
G[错误率 >5%]
H[模型不可用]
end
A --> E
B --> E
C --> F
D --> G
style E fill:#ffcdd2,stroke:#c62828
style F fill:#ffcdd2,stroke:#c62828
style G fill:#ffcdd2,stroke:#c62828
"""
多模态服务监控
"""
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class MetricPoint:
"""指标数据点"""
timestamp: str
model: str
latency_ms: float
cost_usd: float
success: bool
token_count: int = 0
class MultimodalMonitor:
"""多模态服务监控器"""
def __init__(self):
self.metrics: list[MetricPoint] = []
self.alerts: list[str] = []
self.budget_daily_usd: float = 100.0
def record(self, metric: MetricPoint):
"""记录指标"""
self.metrics.append(metric)
self._check_alerts(metric)
def _check_alerts(self, metric: MetricPoint):
"""检查告警条件"""
if metric.latency_ms > 3000:
self.alerts.append(
f"⚠️ 延迟告警: {metric.model} {metric.latency_ms}ms"
)
if not metric.success:
error_rate = self._calc_error_rate()
if error_rate > 0.05:
self.alerts.append(
f"🔴 错误率告警: {error_rate:.1%}"
)
def _calc_error_rate(self) -> float:
"""计算最近错误率"""
recent = self.metrics[-100:]
if not recent:
return 0
errors = sum(1 for m in recent if not m.success)
return errors / len(recent)
def daily_summary(self) -> dict:
"""日度汇总"""
today_metrics = self.metrics # 简化:取全部
total_cost = sum(m.cost_usd for m in today_metrics)
avg_latency = (
sum(m.latency_ms for m in today_metrics) / len(today_metrics)
if today_metrics else 0
)
return {
"日期": datetime.now().strftime("%Y-%m-%d"),
"总请求": len(today_metrics),
"总成本": f"${total_cost:.2f}",
"预算使用率": f"{total_cost/self.budget_daily_usd:.0%}",
"平均延迟": f"{avg_latency:.0f}ms",
"告警数": len(self.alerts),
}
# 示例
monitor = MultimodalMonitor()
monitor.budget_daily_usd = 50.0
# 模拟记录
for i in range(100):
monitor.record(MetricPoint(
timestamp=datetime.now().isoformat(),
model="GPT-4o",
latency_ms=300 + (i % 10) * 50,
cost_usd=0.01,
success=i % 20 != 0,
))
summary = monitor.daily_summary()
print("日度汇总:")
for k, v in summary.items():
print(f" {k}: {v}")
本章小结
- 多模态成本由推理、存储、网络和 GPU 四部分构成
- 智能模型路由按任务复杂度分发:简单→小模型,复杂→强模型,可节省 60%+ 成本
- 多层缓存策略(精确+语义+模板)可降低 30-50% 的重复调用
- 监控仪表盘需覆盖 QPS、成本、延迟和错误率四个核心指标
- 设置预算告警和自动降级机制,防止成本失控
下一章:学习多模态系统的评估方法和质量保障体系。