2 min read312 words

模型量化与压缩

在生产环境中部署大模型，GPU 成本是最大瓶颈。量化和压缩技术可以在保持效果的前提下大幅降低资源消耗。

量化技术概览

graph TB A[模型压缩] --> B[量化 Quantization] A --> C[蒸馏 Distillation] A --> D[剪枝 Pruning] B --> B1[PTQ 训练后量化] B --> B2[QAT 训练感知量化] B --> B3[混合精度量化] C --> C1[大模型 → 小模型] C --> C2[保留关键能力] D --> D1[结构化剪枝] D --> D2[非结构化剪枝] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style B fill:#e8f5e9,stroke:#388e3c,stroke-width:2px

量化方法对比

方法	精度	压缩比	推理速度	效果损失	适用场景
FP16	16-bit	2x	⭐⭐⭐	几乎无	默认选择
INT8	8-bit	4x	⭐⭐⭐⭐	极小	在线推理
INT4 (GPTQ)	4-bit	8x	⭐⭐⭐⭐	轻微	资源受限
GGUF (llama.cpp)	2-6bit	5-16x	⭐⭐⭐	可控	CPU/边缘
AWQ	4-bit	8x	⭐⭐⭐⭐⭐	极小	生产推荐

量化管理器

"""
模型量化配置与管理
"""
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
class QuantMethod(Enum):
FP16 = "fp16"
INT8 = "int8"
INT4_GPTQ = "int4_gptq"
INT4_AWQ = "int4_awq"
GGUF = "gguf"
@dataclass
class QuantConfig:
"""量化配置"""
method: QuantMethod
bits: int
group_size: int = 128       # 量化分组大小
desc_act: bool = False      # 分组排序（GPTQ）
use_exllama: bool = True    # 内核选择
@property
def memory_ratio(self) -> float:
"""相对 FP32 的内存占比"""
return self.bits / 32
@dataclass
class ModelProfile:
"""模型资源档案"""
name: str
params_billion: float
fp16_memory_gb: float
quant_config: QuantConfig | None = None
@property
def estimated_memory_gb(self) -> float:
"""预估量化后显存"""
if self.quant_config is None:
return self.fp16_memory_gb
return self.fp16_memory_gb * (self.quant_config.bits / 16)
@property
def min_gpu_memory_gb(self) -> float:
"""最低 GPU 显存需求（含开销）"""
return self.estimated_memory_gb * 1.2  # 20% 开销
class QuantizationManager:
"""量化管理器"""
# 常见模型档案
MODEL_PROFILES = {
"llama-7b": ModelProfile("Llama-2-7B", 7, 14),
"llama-13b": ModelProfile("Llama-2-13B", 13, 26),
"llama-70b": ModelProfile("Llama-2-70B", 70, 140),
"qwen-7b": ModelProfile("Qwen2.5-7B", 7, 14),
"qwen-72b": ModelProfile("Qwen2.5-72B", 72, 144),
}
GPU_MEMORY = {
"A100-80GB": 80,
"A100-40GB": 40,
"H100-80GB": 80,
"L40S-48GB": 48,
"RTX4090-24GB": 24,
"T4-16GB": 16,
}
def recommend_quant(self, model_key: str, gpu: str) -> QuantConfig:
"""推荐量化方案"""
profile = self.MODEL_PROFILES.get(model_key)
gpu_mem = self.GPU_MEMORY.get(gpu, 24)
if profile is None:
return QuantConfig(method=QuantMethod.INT8, bits=8)
# FP16 能装下就不量化
if profile.fp16_memory_gb * 1.2 <= gpu_mem:
return QuantConfig(method=QuantMethod.FP16, bits=16)
# INT8 能装下
int8_mem = profile.fp16_memory_gb * 0.5 * 1.2
if int8_mem <= gpu_mem:
return QuantConfig(method=QuantMethod.INT8, bits=8)
# INT4 AWQ
int4_mem = profile.fp16_memory_gb * 0.25 * 1.2
if int4_mem <= gpu_mem:
return QuantConfig(method=QuantMethod.INT4_AWQ, bits=4)
# 极端情况：GGUF 2-bit
return QuantConfig(method=QuantMethod.GGUF, bits=3)
def estimate_throughput(self, quant: QuantConfig, gpu: str) -> dict:
"""预估推理吞吐"""
base_tps = {"A100-80GB": 100, "H100-80GB": 150, "L40S-48GB": 80,
"RTX4090-24GB": 70, "T4-16GB": 30}
base = base_tps.get(gpu, 50)
multipliers = {
QuantMethod.FP16: 1.0,
QuantMethod.INT8: 1.3,
QuantMethod.INT4_GPTQ: 1.5,
QuantMethod.INT4_AWQ: 1.8,
QuantMethod.GGUF: 0.8,
}
multiplier = multipliers.get(quant.method, 1.0)
return {
"tokens_per_second": int(base * multiplier),
"gpu": gpu,
"quant": quant.method.value,
}

部署框架选型

graph LR A{模型大小?} -->|<7B| B[llama.cpp / Ollama] A -->|7-70B| C{需要高吞吐?} A -->|>70B| D[多 GPU 张量并行] C -->|是| E[vLLM] C -->|否| F[TGI / Ollama] D --> G[vLLM + TP] D --> H[TensorRT-LLM] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:3px style E fill:#e8f5e9,stroke:#388e3c,stroke-width:2px

框架	特点	量化支持	吞吐	适用场景
vLLM	PagedAttention	AWQ/GPTQ/FP8	最高	高并发推理
TGI	HuggingFace 官方	GPTQ/AWQ	高	HF 生态
Ollama	一键部署	GGUF	中	开发/小规模
llama.cpp	CPU 友好	GGUF	中	边缘/本地
TensorRT-LLM	NVIDIA 优化	INT8/FP8	极高	A100/H100

本章小结

主题	要点
量化方法	AWQ（生产首选）/ GPTQ（兼容好）/ GGUF（CPU）
选择逻辑	FP16 能装 → 不量化; 否则 INT8 → INT4
框架选型	vLLM（高吞吐）/ TGI（HF 生态）/ Ollama（开发）
核心指标	显存占用 × 推理吞吐 × 效果损失

下一章：GPU 资源管理