Skip to content

利用PyTorch搭建大模型

内容概览

  • 训练模型所要的基本要素
  • 自下而上的讨论 Tensor(张量),build models(构建模型),optimizer(优化器)和training loop(训练循环)
  • 关注训练效率,包括如何使用资源,包括内存资源和计算资源

motivating questions

  • QS1: How long would it take to train a 70 Billion parameter model on 15 Trillion tokens on 1024 H100s?
  • 公式
  • 训练Transformer需要的FLOPs估算
    • \text{Total_FLOPS} = 6 \times P \times T
    • P: number of parameters
    • T: number of tokens
  • 显卡集群的算力估算
    • \text{Cluster_Performance} = N \times Pk \times MFU
    • N: number of GPUs
    • Pk: Peak performance
    • MFU: Model FLOPs Utilization
  • 最终估计时间
    • \text{Time} = \frac{Total \_ FLOPS}{Cluster \_ Performance}
  • 假定 MFU =50% peak performance = 989T FLOPS
  • 答案:

    • \text{Time} =\frac{6 \times 7 \times 10^{10} \times 15 \times 10^{12}}{1024 \times 0.5 \times 989 \times 10^{12}} \approx 12441544 \ S \approx 144 \ D
  • QS2: What's the largest model that can you can train on 8 H100s using AdamW (naively)?

  • H100 显存80GB
  • 假定使用 FP32 来存储数据,即4个Bytes
  • 一个参数对应一个本体,一个梯度,两个中心矩
  • Bytes_per_parameter = 16 Bytes
  • 答案:
  • \text{Num_Parameters} = \frac{80 \times 10 \times 10^9}{16} = 4 \times 10^{10}

Memory Accounting

Tensor Basic

Tensor in Meomory

  • Stored in float point numbers!
  • Check different float types

  • float32   4Bytes

  • sign: 1   exponent: 8   fraction: 23
  • THIS might be the maximum precision for deep learning
  • FALLBACK: too large!
  • float16   2Bytes
  • sign: 1   exponent: 5   fraction: 10
  • FALLBACK: the dynamic range is small
  • bfloat16   2Bytes
  • sign: 1   exponent: 8   fraction: 7
  • Difference: range of float32, size of float16 AND a little loss in precision
  • fp8   1Byte   (CRAZY)
  • develop by NVIDIA with 2 variants
  • a small range version[-448,448] and a bigger one [-57344,47344]

  • Training Implications

  • f32 requires big amount of memory
  • fp8, f16 even bf16 may cause instability
  • SOLUTION: mix precision training

Computation Acconting

Tensor in GPU

  • when tensor is created (DEFAULT), it's in CPU's RAM. You need to move them to GPU's DRAM to maiximize the parallelism of GPU

Tensor Storage

x = torch.tensor([
  [0., 1, 2, 3],
  [4 , 5, 6, 7],
  [8 , 9,10,11],
  [12,13,14,15]
])
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
  • tensor in memory is a long array
  • tensor in graph has 2 dimensions leading to 2 strides
  • stride[0] skips 4 data, stride[1] skips 1 data
  • to find (r,c) = (1,2)
    index = r * x.stride[0] + c * x.stride[1]
    

Tensor Slicing

# create a tensor named x
x = torch.tensor([
  [1., 2, 3],
  [4 , 5, 6]
])
y = x[0]
# y is a tensor, also 1st row of x
# y creates a view of x and storage never changes
z = x[:,1]
# z is also a tensor, but 2nd clumn of x
# z still shares same space with x
y = x.view(3,2)
# y is like [[1.,2],[3,4],[5,6]]
# The point is it changes the stride[1]
# y is just a view, so storage... 
# the condition for view is that 'x' should be continuous
y = x.transpose(1,0)
# as the name
y = x.transpose(1,0).contiguous().view(2,3) #correct
# however 'contiguous()' makes a copy of its object

Tensor Elementwise

  • nothing special
  • rsqrt() for every x_i return x_{i\_new} = \frac{1}{\sqrt{x_{i\_old}}}
  • triu() takes the upper triangular part of a matrix

Tensor Matrix Mutiplication

x = torch.ones(16,32)
y = torch.ones(32,2)
w = x @ y
- In general, we perform operations for every example in a batch and token in a sequence. - So,There is tensor(Batch,Channel,Height,Width)

Tensor Einops   (generated by Google Gemini)

  1. rearrange (重排)
  2. 这是最常用的函数,用于改变张量的维度结构(相当于 view, reshape, transpose, permute 的合体)。
  3. 基本语法:
    from einops import rearrange
    output = rearrange(tensor, '输入维度表达式 -> 输出维度表达式', **轴长度)
    
  4. 常见用法: 转置 (Transpose): rearrange(x, 'b c h w -> b c w h') 拉平 (Flatten): rearrange(x, 'b c h w -> b (c h w)') 切片并堆叠 (Space-to-Depth): rearrange(x, 'b c (h p1) (w p2) -> b (c p1 p2) h w', p1=2, p2=2) 图像分块 (Patch Embedding 前奏): rearrange(x, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=16, p2=16)

  5. reduce (归约)

  6. reduce 在改变维度的同时,对指定轴执行聚合操作(如求平均、最大值等),替代了 tensor.mean(), tensor.max() 等并自动处理维度缩减。
  7. 基本语法:
    from einops import reduce
    output = reduce(tensor, '输入维度 -> 输出维度', reduction='聚合方式')
    
  8. 常见用法:
  9. 全局平均池化 (Global Avg Pooling): reduce(x, 'b c h w -> b c', reduction='mean')
  10. 最大池化 (Max Pooling): reduce(x, 'b c (h 2) (w 2) -> b c h w', reduction='max')
  11. 彩色转灰度 (对通道求均值): reduce(x, 'b c h w -> b 1 h w', reduction='mean')

  12. repeat (重复)

  13. 用于沿着某些轴复制张量,替代 tensor.repeat() 或 np.tile(),但语义更清晰。
  14. 基本语法:
    from einops import repeat
    output = repeat(tensor, '输入维度 -> 输出维度', **重复次数)
    
  15. 常见用法:
  16. 增加新维度并复制: repeat(x, 'h w -> c h w', c=3)
  17. 上采样 (最邻近插值): repeat(x, 'b c h w -> b c (h 2) (w 2)')

Computation Cost of Tensor Operation

  • Linear Model
  • N points   D dimensions   K ouput_dimensions
  • so, FLOPs is 2 \times N \times D \times K
  • elementwise operation on m \times n matrix requires O(mn) FLOPs

  • Intepretation:

  • B : number of data points
  • (D K): number of parameters
  • FLOPs for forward pass is 2 (# tokens) (# parameters)
  • It turns out this generalizes to Transformers (to a first-order approximation).

  • MFU: Model FLOPs utilization MFU = \frac{actual \_ flop/s}{promised \_ flop/s}

  • usualy, $MFU \ge 0.5 $ is quite good

Gradients

  • Intro
  • We have constructed tensors and passed them through operations (forward). And we need to compute the gradients (backward).

  • A simple linear model

  • x --w--> pred_y
  • loss function: y = 0.5 (x \times w - 5)^2
  • Forward Pass : compute loss
      x = torch.tensor([1., 2, 3])
      w = torch.tensor([1., 1, 1], requires_grad=True) 
      # store gradient in w.grad
      pred_y = x @ w
      loss = 0.5 * (pred_y - 5).pow(2)
    
  • Backward pass : compute gradient

      loss.backward()
    

  • A 2-layer linear model

  • x --w--> h1 --w2--> h2 --> loss
  • x : B \times D
  • w_1 : D \times D
  • w_2 : D \times K
  • Functions $$ h_1 = x \times w_1 \ h_2 = h_1 \times w_2 \ loss = \mathbb{E}[{h_2}^2] $$
  • FLOPs in one forward pass = 2 \cdot (BD^2 + BDK)
  • FLOPs in one backward pass contains :
  • h1.grad   h2.grad
  • w1.grad   w2.grad
  • WHY? Check in Chain Rule
  • Answer = 2 \cdot (BD^2 + 2 \cdot BDK)
  • Generally, one pass has 2 forward FLOPs and 4 backward FLOPs adding up to 6 FLOPs

Model

Parameter Initialization

  • Suppose we are dealing with a linear model
  • if we write w = nn.Parameter(torch.randn(input_dim, output_dim))
  • The output may get super large leading to an unstable training
  • So, we use w = nn.Parameter(torch.randn(input_dim, output_dim) / np.sqrt(input_dim))
  • In this way, output will follow a normal distribution