319:h20训练报错:
问题1:nvidia h20机器报错:Caught signal 8 (Floating point exception: integer divide by zero)
解决:
pip3 install nvidia-cublas-cu12==12.3.4.1
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
问题2:cannot import name '_get_socket_with_port' from 'torch.distributed.elastic.agent.server.api'
https://github.com/deepspeedai/DeepSpeed/issues/5603