3 min read621 words

Transformer基础架构

理解Transformer是掌握LLM的关键，它是所有现代LLM的基石。

Transformer简介

Transformer是Google在2017年提出的神经网络架构，彻底改变了NLP领域。

为什么重要？

传统RNN	Transformer
串行处理	并行计算
难以捕捉长距离依赖	自注意力机制
训练速度慢	训练速度快10-100倍

核心创新

自注意力机制（Self-Attention）- 捕捉长距离依赖
多头注意力（Multi-Head）- 从多个角度理解
位置编码（Positional Encoding）- 保留序列顺序
残差连接（Residual）- 深度训练稳定性
层归一化（LayerNorm）- 加速收敛

Transformer架构

graph TB A[输入文本] --> B[Token嵌入] B --> C[位置编码] C --> D[编码器层 x N] D --> E[编码器输出] D --> D1[多头注意力] D1 --> D2[加法 & 归一化] D2 --> D3[前馈网络] D3 --> D4[加法 & 归一化] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style E fill:#c8e6c9,stroke:#43a047,stroke-width:2px style D fill:#fff3e0,stroke:#f57c00,stroke-width:2px style D1 fill:#ffccbc,stroke:#d84315,stroke-width:2px style D2 fill:#fff9c4,stroke:#f9a825,stroke-width:2px style D3 fill:#b2dfdb,stroke:#00897b,stroke-width:2px style D4 fill:#f8bbd0,stroke:#c2185b,stroke-width:2px

编码器 vs 解码器

Transformer包含两个主要部分：

编码器（Encoder）

graph TB A[输入序列] --> B[自注意力] B --> C[残差连接] C --> D[层归一化] D --> E[前馈网络] E --> F[残差连接] F --> G[层归一化] G --> H[上下文特征] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style H fill:#c8e6c9,stroke:#43a047,stroke-width:2px style B fill:#fff3e0,stroke:#f57c00,stroke-width:2px style E fill:#ffccbc,stroke:#d84315,stroke-width:2px

处理输入序列
提取上下文特征
适合理解任务

应用: BERT, RoBERTa

解码器（Decoder）

graph TB A[输入序列] --> B[掩码自注意力] B --> C[残差+归一化] C --> D[交叉注意力] D --> E[残差+归一化] E --> F[前馈网络] F --> G[残差+归一化] G --> H[输出序列] style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style H fill:#c8e6c9,stroke:#43a047,stroke-width:2px style B fill:#fce4ec,stroke:#c2185b,stroke-width:2px style D fill:#fff3e0,stroke:#f57c00,stroke-width:2px style F fill:#ffccbc,stroke:#d84315,stroke-width:2px

生成输出序列
带掩码的自注意力
适合生成任务

应用: GPT系列, Llama, Mistral

架构对比

graph LR subgraph "编码器-解码器架构" A[输入] --> B[编码器] B --> C[解码器] C --> D[输出] end subgraph "仅解码器架构" E[输入] --> F[解码器] F --> G[输出] end A -.->|T5, BART| B E -.->|GPT, Llama| F style B fill:#c8e6c9,stroke:#43a047,stroke-width:2px style C fill:#ffccbc,stroke:#d84315,stroke-width:2px style F fill:#b3e5fc,stroke:#0277bd,stroke-width:2px

词汇表（Vocabulary）

Token化

LLM不会直接处理文本，而是将文本转换为数字序列。

# Token化示例
text = "Hello, world!"
# 分词
tokens = ["Hello", ",", "world", "!"]
# 转换为ID
token_ids = [15496, 11, 1917, 0]
# 词汇表大小
vocab_size = 32000  # 通常32K-128K

常用Token化器

Tokenizer	特点	使用模型
BPE	Byte-Pair Encoding	GPT-2, Llama
WordPiece	子词级	BERT
Unigram	语言模型	T5
SentencePiece	语言无关	多语言模型

# 使用tiktoken库（OpenAI的tokenizer）
import tiktoken
# 加载tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
# Tokenize
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Token IDs: {tokens}")
print(f"Token数量: {len(tokens)}")
# Decode back
decoded = enc.decode(tokens)
print(f"解码后: {decoded}")

嵌入层（Embedding Layer）

将Token ID转换为高维向量。

import torch
import torch.nn as nn
vocab_size = 32000
embedding_dim = 768  # 隐藏层维度
# 嵌入层
embedding = nn.Embedding(vocab_size, embedding_dim)
# 输入: [batch_size, seq_len]
token_ids = torch.tensor([[15496, 11, 1917, 0]])
# 输出: [batch_size, seq_len, embedding_dim]
embeddings = embedding(token_ids)
print(f"输入形状: {token_ids.shape}")
print(f"输出形状: {embeddings.shape}")

为什么需要嵌入？

graph LR A[文本] --> B[Token ID] B --> C[嵌入向量] C --> D[神经网络] B -.-> B1[15496, 11, 1917] C -.-> C1[768维向量] style A fill:#fff9c4,stroke:#f9a825,stroke-width:2px style B fill:#ffccbc,stroke:#d84315,stroke-width:2px style C fill:#c8e6c9,stroke:#43a047,stroke-width:2px style D fill:#b3e5fc,stroke:#0277bd,stroke-width:2px

优点: - 语义相近的词在向量空间也相近 - 可以捕捉词之间的相似性 - 可学习的表示

位置编码（Positional Encoding）

Transformer没有循环结构，需要显式编码位置信息。

原理

import numpy as np
import torch
def positional_encoding(max_len, d_model):
"""
创建位置编码
"""
position = np.arange(max_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((max_len, d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return torch.FloatTensor(pe)
# 示例
max_len = 100
d_model = 512
pe = positional_encoding(max_len, d_model)
print(f"位置编码形状: {pe.shape}")
# [100, 512] - 每个位置一个512维向量

可视化

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.imshow(pe.numpy(), aspect='auto', cmap='coolwarm')
plt.colorbar()
plt.xlabel('维度')
plt.ylabel('位置')
plt.title('位置编码可视化')
plt.show()

前馈网络（Feed-Forward Network）

每个Transformer层包含一个两层的全连接网络。

class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)  # 扩展
self.dropout = nn.Dropout(dropout)
self.linear2 = nn.Linear(d_ff, d_model)  # 压缩
def forward(self, x):
# x: [batch_size, seq_len, d_model]
x = self.linear1(x)
x = torch.relu(x)
x = self.dropout(x)
x = self.linear2(x)
return x
# 使用示例
d_model = 768
d_ff = 3072  # 通常是d_model的4倍
ffn = FeedForward(d_model, d_ff)
output = ffn(embeddings)

实际模型配置

GPT-4 配置

参数	值
参数量	~1.76万亿
上下文长度	128K tokens
层数	~96层
注意力头	96-128
嵌入维度	12288

Llama 3 70B 配置

参数	值
参数量	70B
上下文长度	8K tokens
层数	80层
注意力头	64
嵌入维度	8192

实践：简化Transformer

import torch
import torch.nn as nn
class SimpleTransformerBlock(nn.Module):
"""简化的Transformer块"""
def __init__(self, d_model, num_heads, dropout=0.1):
super().__init__()
# 多头注意力
self.attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout)
# 前馈网络
self.ffn = FeedForward(d_model, d_model * 4, dropout)
# 层归一化
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# 自注意力 + 残差连接
attn_output, _ = self.attention(x, x, x)
x = x + self.dropout(attn_output)
x = self.norm1(x)
# 前馈网络 + 残差连接
ffn_output = self.ffn(x)
x = x + self.dropout(ffn_output)
x = self.norm2(x)
return x
# 测试
block = SimpleTransformerBlock(d_model=512, num_heads=8)
x = torch.randn(32, 10, 512)  # [batch, seq_len, d_model]
output = block(x)
print(f"输入形状: {x.shape}")
print(f"输出形状: {output.shape}")

学习要点

✅ Transformer的核心是自注意力机制 ✅ 编码器-解码器 vs 仅解码器架构 ✅ Token化是LLM理解文本的第一步 ✅ 嵌入层将Token转换为向量表示 ✅ 位置编码保留序列信息 ✅ 前馈网络增加非线性表达能力

下一步: 深入理解注意力机制的数学原理 🔬