2 min read344 words

Tokenizer与分词原理

LLM 不理解文字——它理解的是 token。Tokenizer 是文字和模型之间的"翻译官"，直接影响模型的能力边界和成本。

Tokenizer 在 LLM 中的位置

graph LR A[原始文本] --> B[Tokenizer] B --> C[Token IDs] C --> D[嵌入层] D --> E[Transformer] E --> F[输出概率] F --> G[Tokenizer 解码] G --> H[生成文本] style B fill:#fff9c4,stroke:#f9a825,stroke-width:2px style E fill:#e3f2fd,stroke:#1565c0,stroke-width:2px style G fill:#fff9c4,stroke:#f9a825,stroke-width:2px

分词算法对比

算法	代表模型	原理	词表大小	优点
BPE	GPT系列	频率合并字节对	50,000-100,000	处理未知词强
WordPiece	BERT	似然最大化子词	30,000	学术标准
Unigram	T5/LLaMA	概率剪枝	30,000-50,000	多语言友好
SentencePiece	多种	语言无关预处理	可配置	无需预分词

BPE 分词模拟

"""
Byte-Pair Encoding (BPE) 简化实现
"""
from collections import Counter
def get_pair_counts(vocab: dict[str, int]) -> Counter:
"""统计所有相邻字符对的频率"""
pairs = Counter()
for word, count in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += count
return pairs
def merge_pair(
pair: tuple[str, str], vocab: dict[str, int]
) -> dict[str, int]:
"""合并最高频的字符对"""
new_vocab = {}
bigram = " ".join(pair)
replacement = "".join(pair)
for word, count in vocab.items():
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = count
return new_vocab
def train_bpe(text: str, num_merges: int = 10) -> list[tuple]:
"""训练 BPE 分词器"""
# 初始化：每个字符加空格分隔
words = text.split()
vocab: dict[str, int] = Counter()
for word in words:
spaced = " ".join(list(word)) + " </w>"
vocab[spaced] = vocab.get(spaced, 0) + 1
merges = []
for i in range(num_merges):
pairs = get_pair_counts(vocab)
if not pairs:
break
best_pair = pairs.most_common(1)[0][0]
vocab = merge_pair(best_pair, vocab)
merges.append(best_pair)
print(f"第{i+1}次合并: '{best_pair[0]}' + '{best_pair[1]}' → '{''.join(best_pair)}'")
return merges
# 演示
text = "low lower lowest new newer newest"
merges = train_bpe(text, num_merges=5)

Token 计数与成本

from dataclasses import dataclass
@dataclass
class TokenEstimate:
"""Token 估算"""
text: str
char_count: int
estimated_tokens_en: int   # 英文: ~4 chars/token
estimated_tokens_zh: int   # 中文: ~1.5 chars/token
@classmethod
def from_text(cls, text: str) -> "TokenEstimate":
chars = len(text)
# 粗略估算
en_tokens = chars // 4
zh_tokens = int(chars / 1.5)
return cls(text[:50], chars, en_tokens, zh_tokens)
@dataclass
class CostCalculator:
"""API 调用成本计算"""
model: str
input_price_per_1k: float   # 每 1000 token 输入价格 ($)
output_price_per_1k: float
def estimate_cost(
self, input_tokens: int, output_tokens: int
) -> dict:
input_cost = input_tokens / 1000 * self.input_price_per_1k
output_cost = output_tokens / 1000 * self.output_price_per_1k
return {
"model": self.model,
"input_cost": f"${input_cost:.4f}",
"output_cost": f"${output_cost:.4f}",
"total": f"${input_cost + output_cost:.4f}",
}
# 主流模型定价对比
MODELS = [
CostCalculator("GPT-4o", 0.0025, 0.010),
CostCalculator("GPT-4o-mini", 0.00015, 0.0006),
CostCalculator("Claude-3.5-Sonnet", 0.003, 0.015),
CostCalculator("DeepSeek-V3", 0.00014, 0.00028),
]
# 估算一次典型对话的成本
for m in MODELS:
result = m.estimate_cost(input_tokens=2000, output_tokens=1000)
print(f"{result['model']}: {result['total']}")

中文分词的特殊挑战

graph TB A[中文分词挑战] --> B[无天然空格] A --> C[一字多义] A --> D[新词不断涌现] B --> B1["'人工智能' → 几个token？"] C --> C1["'开发' = develop? open?"] D --> D1["'摆烂' → 可能被拆为单字"] style A fill:#e3f2fd,stroke:#1565c0,stroke-width:2px

语言	平均 token 效率	原因
英文	~4 字符/token	拉丁字母天然适合 BPE
中文	~1.2-1.5 字符/token	每个汉字可能是独立 token
日文	~1.5 字符/token	混合汉字+假名
代码	~3 字符/token	关键词可合并，缩进浪费

本章小结

BPE 是主流——GPT 系列使用 BPE，通过频率统计自动合并子词
Token ≠ 字/词——中文 1 个字可能是 1-3 个 token
成本直接相关——token 数量决定 API 调用费用
中文效率较低——同样内容，中文消耗的 token 约为英文的 1.5-2 倍
选对模型——DeepSeek 中文 token 效率高于 GPT 系列

下一章：Prompt Engineering 技巧