2 min read466 words

Token 预算与限速策略

成本优化和缓存降低了单次调用的开销，但不解决"总量失控"的问题。Token 预算与限速是防止账单暴增的最后一道防线。

预算控制体系

graph TB subgraph 预算层级 GLOBAL[全局月度预算] TENANT[租户 / 业务线预算] USER[用户级每日预算] API[单次调用 Token 上限] end subgraph 限速层级 RPM[每分钟请求数 RPM] TPM[每分钟 Token 数 TPM] RPD[每日请求数 RPD] CONCUR[最大并发数] end subgraph 响应动作 WARN[提前告警 80%] THROTTLE[限速降级] BLOCK[硬性拦截] NOTIFY[通知负责人] end GLOBAL --> TENANT --> USER --> API RPM --> THROTTLE TPM --> THROTTLE GLOBAL --> WARN TENANT --> WARN USER --> BLOCK WARN --> NOTIFY THROTTLE --> BLOCK style GLOBAL fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style BLOCK fill:#ffcdd2,stroke:#c62828,stroke-width:2px

Token 预算分配参考

层级	推荐设置方式	示例值
全局月度	按上月实际用量 × 1.3 作为软上限	$5,000/月
业务线	按历史占比分配 + 10% 缓冲	核心业务 60%，测试 20%，内部工具 20%
用户每日	免费层 / 付费层分档	免费 10k Token/天，付费 200k Token/天
单次调用	根据场景设定 max_tokens	摘要 512、对话 2048、长文档 4096

限速中间件实现（Redis 滑动窗口）

import redis
import time
from functools import wraps
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
def rate_limit(key_prefix: str, max_requests: int, window_seconds: int):
"""基于 Redis 滑动窗口的限速装饰器"""
def decorator(func):
@wraps(func)
def wrapper(user_id: str, *args, **kwargs):
key = f"rate:{key_prefix}:{user_id}"
now = time.time()
pipe = r.pipeline()
# 清除过期记录 & 计数
pipe.zremrangebyscore(key, 0, now - window_seconds)
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, window_seconds)
_, _, count, _ = pipe.execute()
if count > max_requests:
raise RateLimitExceeded(
f"超出限制：{max_requests} 次/{window_seconds}s"
)
return func(user_id, *args, **kwargs)
return wrapper
return decorator
class RateLimitExceeded(Exception):
pass
# 使用示例
@rate_limit(key_prefix="llm_api", max_requests=60, window_seconds=60)
def call_llm(user_id: str, prompt: str):
# 实际调用逻辑
pass

Token 用量追踪与预警

class TokenBudgetTracker:
def __init__(self, redis_client, monthly_budget_usd: float):
self.r = redis_client
self.budget = monthly_budget_usd
# GPT-4o 参考价格（$/1M tokens）
self.price_per_1m = {"input": 2.5, "output": 10.0}
def record_usage(self, user_id: str, input_tokens: int, output_tokens: int):
cost = (input_tokens * self.price_per_1m["input"] +
output_tokens * self.price_per_1m["output"]) / 1_000_000
month_key = f"budget:{user_id}:{time.strftime('%Y-%m')}"
current = float(self.r.incrbyfloat(month_key, cost))
self.r.expire(month_key, 35 * 86400)  # 35 天 TTL
usage_pct = current / self.budget
if usage_pct >= 1.0:
raise BudgetExceeded(f"用户 {user_id} 已超出月度预算")
if usage_pct >= 0.8:
self._send_alert(user_id, current, usage_pct)
def _send_alert(self, user_id, spent, pct):
print(f"[警告] 用户 {user_id} 已用 ${spent:.2f}（{pct:.0%}）")

限速策略对比

策略	实现复杂度	防突发	公平性	适用场景
固定窗口计数器	低	差（边界突发）	中	内部工具
滑动窗口日志	中	好	高	API 服务
令牌桶	中	允许短暂突发	高	实时交互
漏桶	中	强（平滑流量）	高	批处理队列

行动清单

[ ] 为全局、业务线、用户三个层级分别设置月度 Token 预算
[ ] 实现 80% 预算触发告警、100% 触发硬拦截
[ ] 部署 Redis 滑动窗口限速中间件，覆盖 RPM 和 TPM
[ ] 为所有 LLM 调用统一设置 max_tokens 参数，防止无限输出
[ ] 构建用量仪表盘，按用户/业务线展示日/月 Token 消耗趋势
[ ] 为付费用户和免费用户设置差异化限速策略
[ ] 每月审查预算使用情况，按实际负载调整分配

下一章：04 - 可观测性与稳定性 → 监控与可观测性