代码详解：用Pytorch训练快速神经网络的9个技巧(3)

发布时间：2019-08-19 04:12 所属栏目：21 来源：读芯术

导读：(https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/?source=post_page---------------------------#16-bit-mixed-precision)，不需对你的模型做任何修改，也不用完成上述操作。

(https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/?source=post_page---------------------------#16-bit-mixed-precision)，不需对你的模型做任何修改，也不用完成上述操作。

trainer = Trainer(amp_level=’O2', use_amp=False) 
trainer.fit(model)

8. 移至多GPU

现在，事情就变得有意思了。有3种(也许更多?)方式训练多GPU。

(1) 分批量训练

代码详解：用Pytorch训练快速神经网络的9个技巧

A)在每个GPU上复制模型;B)给每个GPU分配一部分批量。

第一种方法叫做分批量训练。这一策略将模型复制到每个GPU上，而每个GPU会分到该批量的一部分。

# copy model on each GPU and give a fourth of the batch to each 
model = DataParallel(model, devices=[0, 1, 2 ,3]) 
 
# out has 4 outputs (one for each gpu) 
out = model(x.cuda(0))

在Lightning中，可以直接指示训练器增加GPU数量，而无需完成上述任何操作。

# ask lightning to use 4 GPUs for training 
trainer = Trainer(gpus=[0, 1, 2, 3]) 
trainer.fit(model)

(2) 分模型训练

代码详解：用Pytorch训练快速神经网络的9个技巧

将模型的不同部分分配给不同的GPU，按顺序分配批量

有时模型可能太大，内存不足以支撑。比如，带有编码器和解码器的Sequence to Sequence模型在生成输出时可能会占用20gb的内存。在这种情况下，我们希望把编码器和解码器放在单独的GPU上。

# each model is sooo big we can't fit both in memory 
encoder_rnn.cuda(0) 
decoder_rnn.cuda(1) 
 
# run input through encoder on GPU 0 
out = encoder_rnn(x.cuda(0)) 
 
# run output through decoder on the next GPU 
out = decoder_rnn(x.cuda(1)) 
 
# normally we want to bring all outputs back to GPU 0 
outout = out.cuda(0)

对于这种类型的训练，无需将Lightning训练器分到任何GPU上。与之相反，只要把自己的模块导入正确的GPU的Lightning模块中：

class MyModule(LightningModule): 
 
def __init__():  
        self.encoder = RNN(...) 
        self.decoder = RNN(...) 
 
def forward(x): 
 
    # models won't be moved after the first forward because  
        # they are already on the correct GPUs 
        self.encoder.cuda(0) 
        self.decoder.cuda(1)      
    
out = self.encoder(x) 
        out = self.decoder(out.cuda(1)) 
 
# don't pass GPUs to trainer 
model = MyModule() 
trainer = Trainer() 
trainer.fit(model)

(3) 混合两种训练方法

在上面的例子中，编码器和解码器仍然可以从并行化每个操作中获益。我们现在可以更具创造力了。

# change these lines 
self.encoder = RNN(...) 
self.decoder = RNN(...) 
 
# to these 
# now each RNN is based on a different gpu set 
self.encoder = DataParallel(self.encoder, devices=[0, 1, 2, 3]) 
self.decoder = DataParallel(self.encoder, devices=[4, 5, 6, 7]) 
 
# in forward... 
out = self.encoder(x.cuda(0)) 
 
# notice inputs on first gpu in device 
sout = self.decoder(out.cuda(4))  # <--- the 4 here

（编辑：ASP站长网）