利用PyTorch搭建大模型¶

内容概览¶

训练模型所要的基本要素
自下而上的讨论 Tensor（张量）,build models（构建模型）,optimizer（优化器）和training loop（训练循环）
关注训练效率，包括如何使用资源，包括内存资源和计算资源

motivating questions¶

QS1: How long would it take to train a 70 Billion parameter model on 15 Trillion tokens on 1024 H100s?
公式：
训练Transformer需要的FLOPs估算
- $\text{Total_FLOPS} = 6 \times P \times T$
- P: number of parameters
- T: number of tokens
显卡集群的算力估算
- $\text{Cluster_Performance} = N \times Pk \times MFU$
- N: number of GPUs
- Pk: Peak performance
- MFU: Model FLOPs Utilization
最终估计时间
- $\text{Time} = \frac{Total \_ FLOPS}{Cluster \_ Performance}$
假定 MFU =50% peak performance = 989T FLOPS
答案：
- $\text{Time} =\frac{6 \times 7 \times 10^{10} \times 15 \times 10^{12}}{1024 \times 0.5 \times 989 \times 10^{12}} \approx 12441544 \ S \approx 144 \ D$
QS2: What's the largest model that can you can train on 8 H100s using AdamW (naively)?
H100 显存80GB
假定使用 FP32 来存储数据，即4个Bytes
一个参数对应一个本体，一个梯度，两个中心矩
Bytes_per_parameter = 16 Bytes
答案：
$\text{Num_Parameters} = \frac{80 \times 10 \times 10^9}{16} = 4 \times 10^{10}$

Memory Accounting¶

Tensor Basic¶

Tensors are the basic building block for storing everything: parameters, gradients, optimizer state, data, activations.
How to use Tensors? Check the API

Tensor in Meomory¶

Stored in float point numbers!
Check different float types
float32 4Bytes
sign: 1 exponent: 8 fraction: 23
THIS might be the maximum precision for deep learning
FALLBACK: too large!
float16 2Bytes
sign: 1 exponent: 5 fraction: 10
FALLBACK: the dynamic range is small
bfloat16 2Bytes
sign: 1 exponent: 8 fraction: 7
Difference: range of float32, size of float16 AND a little loss in precision
fp8 1Byte (CRAZY)
develop by NVIDIA with 2 variants
a small range version[-448,448] and a bigger one [-57344,47344]
Training Implications
f32 requires big amount of memory
fp8, f16 even bf16 may cause instability
SOLUTION: mix precision training

Computation Acconting¶

Tensor in GPU¶

when tensor is created (DEFAULT), it's in CPU's RAM. You need to move them to GPU's DRAM to maiximize the parallelism of GPU

Tensor Storage¶

x = torch.tensor([
  [0., 1, 2, 3],
  [4 , 5, 6, 7],
  [8 , 9,10,11],
  [12,13,14,15]
])

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

tensor in memory is a long array
tensor in graph has 2 dimensions leading to 2 strides
stride[0] skips 4 data, stride[1] skips 1 data

to find (r,c) = (1,2)

index = r * x.stride[0] + c * x.stride[1]

Tensor Slicing¶

# create a tensor named x
x = torch.tensor([
  [1., 2, 3],
  [4 , 5, 6]
])
y = x[0]
# y is a tensor, also 1st row of x
# y creates a view of x and storage never changes
z = x[:,1]
# z is also a tensor, but 2nd clumn of x
# z still shares same space with x
y = x.view(3,2)
# y is like [[1.,2],[3,4],[5,6]]
# The point is it changes the stride[1]
# y is just a view, so storage... 
# the condition for view is that 'x' should be continuous
y = x.transpose(1,0)
# as the name
y = x.transpose(1,0).contiguous().view(2,3) #correct
# however 'contiguous()' makes a copy of its object

Tensor Elementwise¶

nothing special
rsqrt() for every $x_i$ return $x_{i\_new} = \frac{1}{\sqrt{x_{i\_old}}}$
triu() takes the upper triangular part of a matrix

Tensor Matrix Mutiplication¶

x = torch.ones(16,32)
y = torch.ones(32,2)
w = x @ y

- In general, we perform operations for every example in a batch and token in a sequence. - So,There is tensor(Batch,Channel,Height,Width)

Tensor Einops (generated by Google Gemini)¶

rearrange (重排)
这是最常用的函数，用于改变张量的维度结构（相当于 view, reshape, transpose, permute 的合体）。

基本语法：

from einops import rearrange
output = rearrange(tensor, '输入维度表达式 -> 输出维度表达式', **轴长度)

常见用法：转置 (Transpose): rearrange(x, 'b c h w -> b c w h') 拉平 (Flatten): rearrange(x, 'b c h w -> b (c h w)') 切片并堆叠 (Space-to-Depth): rearrange(x, 'b c (h p1) (w p2) -> b (c p1 p2) h w', p1=2, p2=2) 图像分块 (Patch Embedding 前奏): rearrange(x, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=16, p2=16)
reduce (归约)
reduce 在改变维度的同时，对指定轴执行聚合操作（如求平均、最大值等），替代了 tensor.mean(), tensor.max() 等并自动处理维度缩减。

基本语法：

from einops import reduce
output = reduce(tensor, '输入维度 -> 输出维度', reduction='聚合方式')

常见用法：
全局平均池化 (Global Avg Pooling): reduce(x, 'b c h w -> b c', reduction='mean')
最大池化 (Max Pooling): reduce(x, 'b c (h 2) (w 2) -> b c h w', reduction='max')
彩色转灰度 (对通道求均值): reduce(x, 'b c h w -> b 1 h w', reduction='mean')
repeat (重复)
用于沿着某些轴复制张量，替代 tensor.repeat() 或 np.tile()，但语义更清晰。

基本语法：

from einops import repeat
output = repeat(tensor, '输入维度 -> 输出维度', **重复次数)

常见用法：
增加新维度并复制: repeat(x, 'h w -> c h w', c=3)
上采样 (最邻近插值): repeat(x, 'b c h w -> b c (h 2) (w 2)')

Computation Cost of Tensor Operation¶

Linear Model
N points D dimensions K ouput_dimensions
so, FLOPs is $2 \times N \times D \times K$
elementwise operation on $m \times n$ matrix requires $O(mn)$ FLOPs
Intepretation:
B : number of data points
(D K): number of parameters
FLOPs for forward pass is 2 (# tokens) (# parameters)
It turns out this generalizes to Transformers (to a first-order approximation).
MFU: Model FLOPs utilization $MFU = \frac{actual \_ flop/s}{promised \_ flop/s}$
usualy, $MFU \ge 0.5 $ is quite good

Gradients¶

Intro
We have constructed tensors and passed them through operations (forward). And we need to compute the gradients (backward).
A simple linear model
x --w--> pred_y
loss function: $y = 0.5 (x \times w - 5)^2$

Forward Pass : compute loss

  x = torch.tensor([1., 2, 3])
  w = torch.tensor([1., 1, 1], requires_grad=True) 
  # store gradient in w.grad
  pred_y = x @ w
  loss = 0.5 * (pred_y - 5).pow(2)

Backward pass : compute gradient
```
  loss.backward()
```
A 2-layer linear model
x --w--> h1 --w2--> h2 --> loss
$x$ : $B \times D$
$w_1$ : $D \times D$
$w_2$ : $D \times K$
Functions $$ h_1 = x \times w_1 \ h_2 = h_1 \times w_2 \ loss = \mathbb{E}[{h_2}^2] $$
FLOPs in one forward pass = $2 \cdot (BD^2 + BDK)$
FLOPs in one backward pass contains :
h1.grad h2.grad
w1.grad w2.grad
WHY? Check in Chain Rule
Answer = $2 \cdot (BD^2 + 2 \cdot BDK)$
Generally, one pass has 2 forward FLOPs and 4 backward FLOPs adding up to 6 FLOPs

Model¶

Parameter Initialization¶

Suppose we are dealing with a linear model
if we write w = nn.Parameter(torch.randn(input_dim, output_dim))
The output may get super large leading to an unstable training
So, we use w = nn.Parameter(torch.randn(input_dim, output_dim) / np.sqrt(input_dim))
In this way, output will follow a normal distribution