当使用单卡训练时运行命令:python tools/train.py ${CONFIG_FILE} [ARGS]
是可以跑通的,但是使用官方提供的:bash ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
进行单机多卡训练时却报如下错误:
....
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB. GPU 0 has a total capacty of 23.64 GiB of which 59.25 MiB is free. Process 727402 has 1.89 GiB memory in use. Including non-PyTorch memory, this process has 21.32 GiB memory in use. Of the allocated memory 20.56 GiB is allocated by PyTorch, and 312.04 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-02-06 16:12:08,473] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 727401) of binary:
....
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:time : 2024-02-06_16:12:08host : yons-MS-7E06rank : 1 (local_rank: 1)exitcode : 1 (pid: 727402)error_file: <N/A>traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:time : 2024-02-06_16:12:08host : yons-MS-7E06rank : 0 (local_rank: 0)exitcode : 1 (pid: 727401)error_file: <N/A>traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
下面说一下这个问题的解决过程。
首先我是在一台双卡主板上跑模型,一开始只用单卡训练,batch size为50:
然后看了下官方文档,单机多卡是要运行另一个sh文件,batch size也为64,运行命令报上面那个错误。有显存溢出的错误也有torch.distributed.elastic.multiprocessing.errors.ChildFailedError的错误,我感觉很不解,照理说单卡50多卡应该100都能行。
因为openmmlab封装的比较复杂,这部分的底层源码比较不容易看到,所以一直百度或者github上看别人提的isuue好像也没有发现解决方法。
后面我逐渐下调batch size至32才无报错。看了下此时的显卡使用情况:
显卡竟然都是占满的,此时我无意间看了一下旁边一台单卡主板(同24g显存)跑的同样模型以及同样的数据集,batch size也为32:
当我看到546和1092我瞬间明白,原来这里的batch size是指定每张卡的batch size而不是总共的batch size,折磨了我一整天的问题终于解决…
但是我还有一个问题还没有解决,就是我的双卡设备在跑训练时,如果使用单卡batch size能到50,但是如果使用双卡时每张卡的batch size却只能到32,这是为什么呢?