利用PyTorch搭建大模型¶
内容概览¶
- 训练模型所要的基本要素
- 自下而上的讨论 Tensor(张量),build models(构建模型),optimizer(优化器)和training loop(训练循环)
- 关注训练效率,包括如何使用资源,包括内存资源和计算资源
motivating questions¶
- QS1: How long would it take to train a 70 Billion parameter model on 15 Trillion tokens on 1024 H100s?
- 公式:
- 训练Transformer需要的FLOPs估算
- \text{Total_FLOPS} = 6 \times P \times T
- P: number of parameters
- T: number of tokens
- 显卡集群的算力估算
- \text{Cluster_Performance} = N \times Pk \times MFU
- N: number of GPUs
- Pk: Peak performance
- MFU: Model FLOPs Utilization
- 最终估计时间
- \text{Time} = \frac{Total \_ FLOPS}{Cluster \_ Performance}
- 假定 MFU =50% peak performance = 989T FLOPS
-
答案:
- \text{Time} =\frac{6 \times 7 \times 10^{10} \times 15 \times 10^{12}}{1024 \times 0.5 \times 989 \times 10^{12}} \approx 12441544 \ S \approx 144 \ D
-
QS2: What's the largest model that can you can train on 8 H100s using AdamW (naively)?
- H100 显存80GB
- 假定使用 FP32 来存储数据,即4个Bytes
- 一个参数对应一个本体,一个梯度,两个中心矩
- Bytes_per_parameter = 16 Bytes
- 答案:
- \text{Num_Parameters} = \frac{80 \times 10 \times 10^9}{16} = 4 \times 10^{10}
Memory Accounting¶
Tensor Basic¶
- Tensors are the basic building block for storing everything: parameters, gradients, optimizer state, data, activations.
- How to use Tensors? Check the API
Tensor in Meomory¶
- Stored in float point numbers!
-
Check different float types
-
float32 4Bytes
- sign: 1 exponent: 8 fraction: 23
- THIS might be the maximum precision for deep learning
- FALLBACK: too large!
- float16 2Bytes
- sign: 1 exponent: 5 fraction: 10
- FALLBACK: the dynamic range is small
- bfloat16 2Bytes
- sign: 1 exponent: 8 fraction: 7
- Difference: range of float32, size of float16 AND a little loss in precision
- fp8 1Byte (CRAZY)
- develop by NVIDIA with 2 variants
-
a small range version[-448,448] and a bigger one [-57344,47344]
-
Training Implications
- f32 requires big amount of memory
- fp8, f16 even bf16 may cause instability
- SOLUTION: mix precision training
Computation Acconting¶
Tensor in GPU¶
- when tensor is created (DEFAULT), it's in CPU's RAM. You need to move them to GPU's DRAM to maiximize the parallelism of GPU
Tensor Storage¶
| 0 | 1 | 2 | 3 |
|---|---|---|---|
| 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 |
- tensor in memory is a long array
- tensor in graph has 2 dimensions leading to 2 strides
stride[0]skips 4 data,stride[1]skips 1 data- to find (r,c) = (1,2)
Tensor Slicing¶
# create a tensor named x
x = torch.tensor([
[1., 2, 3],
[4 , 5, 6]
])
y = x[0]
# y is a tensor, also 1st row of x
# y creates a view of x and storage never changes
z = x[:,1]
# z is also a tensor, but 2nd clumn of x
# z still shares same space with x
y = x.view(3,2)
# y is like [[1.,2],[3,4],[5,6]]
# The point is it changes the stride[1]
# y is just a view, so storage...
# the condition for view is that 'x' should be continuous
y = x.transpose(1,0)
# as the name
y = x.transpose(1,0).contiguous().view(2,3) #correct
# however 'contiguous()' makes a copy of its object
Tensor Elementwise¶
- nothing special
rsqrt()for every x_i return x_{i\_new} = \frac{1}{\sqrt{x_{i\_old}}}triu()takes the upper triangular part of a matrix
Tensor Matrix Mutiplication¶
- In general, we perform operations for every example in a batch and token in a sequence. - So,There is tensor(Batch,Channel,Height,Width)Tensor Einops (generated by Google Gemini)¶
- rearrange (重排)
- 这是最常用的函数,用于改变张量的维度结构(相当于 view, reshape, transpose, permute 的合体)。
- 基本语法:
-
常见用法: 转置 (Transpose):
rearrange(x, 'b c h w -> b c w h')拉平 (Flatten):rearrange(x, 'b c h w -> b (c h w)')切片并堆叠 (Space-to-Depth):rearrange(x, 'b c (h p1) (w p2) -> b (c p1 p2) h w', p1=2, p2=2)图像分块 (Patch Embedding 前奏):rearrange(x, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=16, p2=16) -
reduce (归约)
- reduce 在改变维度的同时,对指定轴执行聚合操作(如求平均、最大值等),替代了 tensor.mean(), tensor.max() 等并自动处理维度缩减。
- 基本语法:
- 常见用法:
- 全局平均池化 (Global Avg Pooling):
reduce(x, 'b c h w -> b c', reduction='mean') - 最大池化 (Max Pooling):
reduce(x, 'b c (h 2) (w 2) -> b c h w', reduction='max') -
彩色转灰度 (对通道求均值):
reduce(x, 'b c h w -> b 1 h w', reduction='mean') -
repeat (重复)
- 用于沿着某些轴复制张量,替代 tensor.repeat() 或 np.tile(),但语义更清晰。
- 基本语法:
- 常见用法:
- 增加新维度并复制:
repeat(x, 'h w -> c h w', c=3) - 上采样 (最邻近插值):
repeat(x, 'b c h w -> b c (h 2) (w 2)')
Computation Cost of Tensor Operation¶
- Linear Model
- N points D dimensions K ouput_dimensions
- so, FLOPs is 2 \times N \times D \times K
-
elementwise operation on m \times n matrix requires O(mn) FLOPs
-
Intepretation:
- B : number of data points
- (D K): number of parameters
- FLOPs for forward pass is 2 (# tokens) (# parameters)
-
It turns out this generalizes to Transformers (to a first-order approximation).
-
MFU: Model FLOPs utilization MFU = \frac{actual \_ flop/s}{promised \_ flop/s}
- usualy, $MFU \ge 0.5 $ is quite good
Gradients¶
- Intro
-
We have constructed tensors and passed them through operations (forward). And we need to compute the gradients (backward).
-
A simple linear model
- x --w--> pred_y
- loss function: y = 0.5 (x \times w - 5)^2
- Forward Pass : compute loss
-
Backward pass : compute gradient
-
A 2-layer linear model
- x --w--> h1 --w2--> h2 --> loss
- x : B \times D
- w_1 : D \times D
- w_2 : D \times K
- Functions $$ h_1 = x \times w_1 \ h_2 = h_1 \times w_2 \ loss = \mathbb{E}[{h_2}^2] $$
- FLOPs in one forward pass = 2 \cdot (BD^2 + BDK)
- FLOPs in one backward pass contains :
- h1.grad h2.grad
- w1.grad w2.grad
- WHY? Check in Chain Rule
- Answer = 2 \cdot (BD^2 + 2 \cdot BDK)
- Generally, one pass has 2 forward FLOPs and 4 backward FLOPs adding up to 6 FLOPs
Model¶
Parameter Initialization¶
- Suppose we are dealing with a linear model
- if we write
w = nn.Parameter(torch.randn(input_dim, output_dim)) - The output may get super large leading to an unstable training
- So, we use
w = nn.Parameter(torch.randn(input_dim, output_dim) / np.sqrt(input_dim)) - In this way, output will follow a normal distribution