Megatron-LM中的权重初始化

现象描述：在某平台上上训练大模型时，受限于内存不够，需要使用模型并行。
例如模型结构：下采样(包含卷积等层)⇒SwinTransformer⇒ 上采样(包含卷积等)，Megatron-LM只对Transformer部分进行了初始化。
当tp=6时，如果全部随机初始化权重，transformer层在不同rank上的权重是不同的(因为每个rank保持不同部分transformer权重)；但其他无法并行的层(例如卷积) (经打印验证) 在6个节点都保持了相同的初始化权重。
由此对Megatron-LM和PyTorch中权重的初始化方式产生了疑惑。

随机初始化

我们以torch.nn.Conv2d为例子，它的随机初始化定义在基类

torch.nn.modules.conv._ConvNd torch/nn/modules/conv.py

可以Google搜索：caffe2.ai: xxxtorch api 来查找

Caffe2 - Python API: torch.nn.modules.conv._ConvNd Class Reference

nn.Module是所有神经网络单元（neural network modules）的基类，其中负责初始化参数的函数是 reset_parameters(self)

class _ConvNd(Module):
 
  __constants__ = ['stride', 'padding', 'dilation', 'groups', 'bias', 'padding_mode']
 
  def __init__(self, in_channels, out_channels, kernel_size, stride,
               padding, dilation, transposed, output_padding,
               groups, bias, padding_mode):
      ......
      if transposed:
          self.weight = Parameter(torch.Tensor(
              in_channels, out_channels // groups, *kernel_size))
      else:
          self.weight = Parameter(torch.Tensor(
              out_channels, in_channels // groups, *kernel_size))
      if bias:
          self.bias = Parameter(torch.Tensor(out_channels))
      else:
          self.register_parameter('bias', None)
      self.reset_parameters()
      ......
  def reset_parameters(self):
      n = self.in_channels
      init.kaiming_uniform_(self.weight, a=math.sqrt(5))
      if self.bias is not None:
          fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
          bound = 1 / math.sqrt(fan_in)
          init.uniform_(self.bias, -bound, bound)

可以看到，随机参数的初始化分为如下两个步骤：

self.weight = Parameter( torch.Tensor(out_channels, in_channels // group, **kernel_size ) )

self.bias = Parameter(torch.Tensor(out_channels))

这里torch.nn.Parameter 继承自 torch.Tensor，使用Parameter创建的张量：
- 会自动设置 requires_grad 为True。
- 会将该变量加入到模型的named_parameters当中。
当然，如果没有
reset_parameters 对 self.weight 和 self.bias进行重制。
- weight初始化： init.kaiming_uniform_(self.weight, a=math.sqrt(5))，这里使用的是HE初始化(凯明何的初始化)的均匀分布采样，采样的bound：
$$
bound=\sqrt{\frac{6}{(1+a)^2\times fan_{in}}}
$$

其中$fan_{in}$ 是输入神经元数量，输入通道 x 卷积大小K^2
- bias初始化：首先根据weight来计算$fan_{in}$和$fan_{out}$，然后用
$$
bound=\sqrt{\frac{1}{fan_{in}}}
$$

采用均匀分布进行采样。

这里，bias和weight都在相同的采样区间里，采用均匀分布来采样。

关于torch.nn.init中的采样方法，可以参见blog：

https://blog.csdn.net/imblackcat/article/details/131473894

那么我们来关注另一个问题，有了初始化方法，它又是怎么同随机种子相关联的？

首先，megatron的随机种子是在args.seed参数来设置的，默认1234。

但实际设置时，采用了函数_set_random_seed

def _set_random_seed(seed_, data_parallel_random_init=False):
    """Set random seed for reproducability."""
    if seed_ is not None and seed_ > 0:
        # Ensure that different pipeline MP stages get different seeds.
        seed = seed_ + (100 * mpu.get_pipeline_model_parallel_rank())
        # Ensure different data parallel ranks get different seeds
        if data_parallel_random_init:
            seed = seed + (10 * mpu.get_data_parallel_rank())
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        if torch.cuda.device_count() > 0:
            tensor_parallel.model_parallel_cuda_manual_seed(seed)
    else:
        raise ValueError('Seed ({}) should be a positive integer.'.format(seed))

这里有些细节：

流水线并行组各rank上的seed是不同的，那么流水线并行不同rank上 哪怕是未切分的层也是不同的随机参数。但是这保证了不同Transformer层的权重是不一样的。
首先，默认情况下，不同的流水线阶段之间的种子是不一样的，容易理解不同transformer层的初始化种子不一致。
可选的，可以让不同数据对应的模型副本采用不同的随机初始化方法。
设置了三种种子：
- random.seed：影响random.random… random.randint等
- np.random.seed：影响np.random.rand()、np.random.randn()等
- torch.manual_seed：影响torch.rand()、torch.randn()和torch.randint()等。

这意味着：

张量并行组的各rank上的seed是相同的，也就意味着随机发生器所生成的随机序列是确定的，对于非切分的层，对应层的初始化参数是完全相同的。

但是不同次序组织的层，哪怕相同规模和设置，由于随机生成的发生次序不一样，因此彼此之间随机参数不同。
而不同规模或者类型的层，哪怕两次运行位于相同的随机次序上，由于采样的范围、采样的方法不同，生成的随机参数也是不一样的。

实验证明，torch.init..随机方法，每次会消耗一个随机生成器产生的随机状态，比如在一个卷积层中，初始化了两个参数，消耗了两个随机状态，下一个卷积或者层的初始化参数会使用第三个状态。那么，我们可以用两个torch.init.去替代第一个卷积，保证下一个卷积生成的随机参数还是相同的。

import torch
import random
import numpy as np
seed=1234

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

#base和1_1 都能保证bias生成相同, layer2weight相同·
#layer1 = torch.nn.Conv2d(10, 10, kernel_size=4) #bias shape: c_out 10, weight shape (c_out, c_in, k1, k2) = 1600, base  
#layer1 = torch.nn.Conv2d(40, 10, kernel_size=2) #1_1 bias shape: c_out 10, weight shape (10,40, 2,2) = 1600,相同bias采样范围, 相同的采样数量

layer1 = torch.nn.Conv2d(20, 10, kernel_size=2)   #1_2 对照组，不同的采样范围和采样数量,但是layer2的权重相同

#猜测，一次完整采样用一个随机生成的东东
# 那加了layer3 和 两次随机采样应该是等效的，看layer2的weight是相同的
#layer3 = torch.nn.Conv2d(20, 10, kernel_size=2) 

#weight = torch.ones([10,20,2,2])
#bias = torch.ones([1,10])
#torch.nn.init.kaiming_uniform_(weight)
#torch.nn.init.uniform_(bias)
#
##进一步测试 增加一行bias的uniform，layer2的权重应该会有变化
#bias2 = torch.ones([1,10])
#torch.nn.init.uniform_(bias2)

##再进一步测试，使用两次init_uniform_，让layer2保持相同
#bias2 = torch.ones([1,10])
#torch.nn.init.uniform_(bias2)
#torch.nn.init.uniform_(bias2)

layer2 = torch.nn.Conv2d(3,3, kernel_size=2)

#same_weight = torch.equal(layer1.weight,layer2.weight)
#same_bias = torch.equal(layer1.bias, layer2.bias)

#print(f"weight same is {same_weight}, bias same is {same_bias}")

print(f"layer1.bias {layer1.bias}, layer2.weight {layer2.weight}")

(我们可以尝试，在随机生成参数时，让并行卷积层在相同的随机生成次序上，在相同采样范围和采样方法？ )

那么对于张量并行，位于不同rank上的切分的层，是如何实现权重不同的？

我们可以在megatron-sw/megatron/core/tensor_parallel/layers.py 当中ColumnParallelLinear或者 RowParallelLinear找到答案：

if use_cpu_initialization:
    self.weight = Parameter(torch.empty(self.output_size,
                                        self.input_size_per_partition,
                                        dtype=params_dtype))
    if perform_initialization:
        self.master_weight = _initialize_affine_weight_cpu(
            self.weight, self.output_size, self.input_size,
            self.input_size_per_partition, 1, init_method,
            stride=stride, return_master_weight=keep_master_weight_for_test,
            params_dtype=params_dtype)

对于weight而言，生成的是一个列被切分后大小，cpu初始化的话，这里还有一个master_weight

def _initialize_affine_weight_cpu(weight, output_size, input_size,
                                  per_partition_size, partition_dim,
                                  init_method, stride=1,
                                  return_master_weight=False,
                                  *, params_dtype=torch.float32):
    """Initialize affine weight for model parallel.

    Build the master weight on all processes and scatter
    the relevant chunk."""

    set_tensor_model_parallel_attributes(tensor=weight,
                                         is_parallel=True,
                                         dim=partition_dim,
                                         stride=stride)

    # Initialize master weight
    master_weight = torch.empty(output_size, input_size,
                                dtype=torch.float,
                                requires_grad=False)
    init_method(master_weight)
    master_weight = master_weight.to(dtype=params_dtype)

    # Split and copy
    per_partition_per_stride_size = divide(per_partition_size, stride)
    weight_list = torch.split(master_weight, per_partition_per_stride_size,
                              dim=partition_dim)
    rank = get_tensor_model_parallel_rank()
    world_size = get_tensor_model_parallel_world_size()
    my_weight_list = weight_list[rank::world_size]

    with torch.no_grad():
        torch.cat(my_weight_list, dim=partition_dim, out=weight)
    if return_master_weight:
        return master_weight
    return None

这里可以看到，创建了一个master_weight权重副本，然后对它进行了切分，根据rank来返回不同的weight_list切片weight_list[rank::world_size]，这里的start=rank,step=world_size. 貌似是比较安全的写法？但如果切分份数就是world_size的情况下，显然是取了rank对应的权重切片。

然后再将本rank的切片weight重新用cat的方式组织，覆盖写回weight当中。

但是在gpu版本的初始化中，是直接对weight进行了初始化。

def _initialize_affine_weight_gpu(weight, init_method,
                                  partition_dim, stride=1):
    """Initialize affine weight for model parallel on GPU."""

    set_tensor_model_parallel_attributes(tensor=weight,
                                         is_parallel=True,
                                         dim=partition_dim,
                                         stride=stride)
    with get_cuda_rng_tracker().fork():
        init_method(weight)

这里保证了每个gpu(哪怕在随机种子相同的情况下)单独创建一个互不相同的独立RNG状态，保证彼此之间生成的是不一样的。

总结下，张量并行初始化线性层参数的过程：

GPU上，通过fork一个rng状态，让不同GPU分别产生不同的随机参数；

CPU上，通过大家同时生成一个大的相同的随机参数副本，然后根据张量并行rank来选择不同的切片作为自己的随机参数。

张量并行中，(非切分)层的随机初始化过程是完全一致的，因为不同节点的种子一样，随机初始化的次序也相同。

但是流水线并行，会在megatron设置随机种子时，让不同流水线阶段rank的随机种子不同。

参考博客：Megatron-LM源码系列(一)：模型并行初始化 | MLTalks

保住饭碗

#Megatron-LM #PyTorch #深度学习

Megatron-LM中的权重初始化

http://example.com/posts/76dfb04d/

发布于

2024年12月1日

许可协议

Megatron-LM中的loss-scale 下一篇