(https://williamfalcon.github.io/pytorch-lightning/Trainer/Distributed%20training/?source=post_page---------------------------#16-bit-mixed-precision),不需对你的模型做任何修改,也不用完成上述操作。
- trainer = Trainer(amp_level=’O2', use_amp=False)
- trainer.fit(model)
8. 移至多GPU
现在,事情就变得有意思了。有3种(也许更多?)方式训练多GPU。
(1) 分批量训练
A)在每个GPU上复制模型;B)给每个GPU分配一部分批量。
第一种方法叫做分批量训练。这一策略将模型复制到每个GPU上,而每个GPU会分到该批量的一部分。
- # copy model on each GPU and give a fourth of the batch to each
- model = DataParallel(model, devices=[0, 1, 2 ,3])
-
- # out has 4 outputs (one for each gpu)
- out = model(x.cuda(0))
在Lightning中,可以直接指示训练器增加GPU数量,而无需完成上述任何操作。
- # ask lightning to use 4 GPUs for training
- trainer = Trainer(gpus=[0, 1, 2, 3])
- trainer.fit(model)
(2) 分模型训练
将模型的不同部分分配给不同的GPU,按顺序分配批量
有时模型可能太大,内存不足以支撑。比如,带有编码器和解码器的Sequence to Sequence模型在生成输出时可能会占用20gb的内存。在这种情况下,我们希望把编码器和解码器放在单独的GPU上。
- # each model is sooo big we can't fit both in memory
- encoder_rnn.cuda(0)
- decoder_rnn.cuda(1)
-
- # run input through encoder on GPU 0
- out = encoder_rnn(x.cuda(0))
-
- # run output through decoder on the next GPU
- out = decoder_rnn(x.cuda(1))
-
- # normally we want to bring all outputs back to GPU 0
- outout = out.cuda(0)
对于这种类型的训练,无需将Lightning训练器分到任何GPU上。与之相反,只要把自己的模块导入正确的GPU的Lightning模块中:
- class MyModule(LightningModule):
-
- def __init__():
- self.encoder = RNN(...)
- self.decoder = RNN(...)
-
- def forward(x):
-
- # models won't be moved after the first forward because
- # they are already on the correct GPUs
- self.encoder.cuda(0)
- self.decoder.cuda(1)
-
- out = self.encoder(x)
- out = self.decoder(out.cuda(1))
-
- # don't pass GPUs to trainer
- model = MyModule()
- trainer = Trainer()
- trainer.fit(model)
(3) 混合两种训练方法
在上面的例子中,编码器和解码器仍然可以从并行化每个操作中获益。我们现在可以更具创造力了。
- # change these lines
- self.encoder = RNN(...)
- self.decoder = RNN(...)
-
- # to these
- # now each RNN is based on a different gpu set
- self.encoder = DataParallel(self.encoder, devices=[0, 1, 2, 3])
- self.decoder = DataParallel(self.encoder, devices=[4, 5, 6, 7])
-
- # in forward...
- out = self.encoder(x.cuda(0))
-
- # notice inputs on first gpu in device
- sout = self.decoder(out.cuda(4)) # <--- the 4 here
(编辑:ASP站长网)
|