以Deformable_DETR为例说明使用服务器训练过程
下载程序文件
根据论文提供的github地址fundamentalvision/Deformable-DETR: Deformable DETR: Deformable Transformers for End-to-End Object Detection.下载zip到本地
租用服务器
在autodl平台租用服务器,申请账号氪金之后去市场选GPU,最好选择空余GPU数量多的,要不然用的时候可能会用不了
然后在下面选择需要的镜像,自己设置pytorch版本和python版本,亦或者,使用社区镜像,就是加入你需要复现的程序很多人都复现过,直接用别人配置好的环境,
这里我使用社区镜像,直接搜索项目就能搜到,但是推荐使用基础镜像自己根据项目选择合适的版本,
链接服务器
这里使用pycharm链接服务器传文件,(有别的更快的方法传文件,自己搜)
先选择无卡模式开机,花钱少,然后进入pycharm-》设置-》项目-》解释器-》添加解释器-》使用ssh,
复制容器上的ssh账号,将端口号输入后面,然后@以及@符号之前的全部删除,之后再输入密码,就能链接成功,额额额,累了,贴个BV自己看吧BV1gn4y1o7PB
开始训练
(其实这一部分是先写的,因为跑起来之后想着赶紧关机省钱,别急,你来你也是。)
先下载需要的库,torch之类的镜像上已经下好了,接下来就按着GitHub上的来就行了
pip install -r requirements.txt
编译cuda
cd ./models/ops
sh ./make.sh
python test.py
下载数据集,项目中使用的是coco,但是coco很大,跑一次很久,而且传服务器上也很慢,这里笔者使用从coco数据集里截取的部分,有50、1000、3000张三个版本,里面也有原版本:**链接:https://pan.baidu.com/s/17wpMzJvzSQ-a3qbaqIbmdg?pwd=data
提取码:data **
数据集按照以下这种方式进行摆放:
Deformable_DETR/
└── data/└── coco/├── train2017/├── val2017/└── annotations/├── instances_train2017.json└── instances_val2017.json
└── configs/
└── ....../
开始训练,单张GPU单节点直接修改原git上的8改为1就可,其实直接删除应该也可以
GPUS_PER_NODE=1 ./tools/run_dist_launch.sh 1 ./configs/r50_deformable_detr.sh
改用镜像前出现各种各样的问题,使用社区镜像之后就没问题了,主要是自己设置的镜像没有社区镜像那个版本,
使用社区版后报的错有:
(MV) root@autodl-container-8250118952-357c0e14:~/autodl-tmp/Deformable_DETR# GPUS_PER_NODE=1 ./tools/run_dist_launch.sh 1 ./configs/r50_deformable_detr.sh
+ GPUS=1
+ RUN_COMMAND=./configs/r50_deformable_detr.sh
+ '[' 1 -lt 8 ']'
+ GPUS_PER_NODE=1
+ MASTER_ADDR=127.0.0.1
+ MASTER_PORT=29500
+ NODE_RANK=0
+ let NNODES=GPUS/GPUS_PER_NODE
+ python ./tools/launch.py --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 29500 --nproc_per_node 1 ./configs/r50_deformable_detr.sh
+ EXP_DIR=exps/r50_deformable_detr
+ PY_ARGS=
+ python -u main.py --output_dir exps/r50_deformable_detr
Traceback (most recent call last):File "main.py", line 21, in <module>import datasetsFile "/root/autodl-tmp/Deformable_DETR/datasets/__init__.py", line 13, in <module>from .coco import build as build_cocoFile "/root/autodl-tmp/Deformable_DETR/datasets/coco.py", line 22, in <module>from util.misc import get_local_rank, get_local_sizeFile "/root/autodl-tmp/Deformable_DETR/util/misc.py", line 32, in <module>from torchvision.ops.misc import _NewEmptyTensorOp
ImportError: cannot import name '_NewEmptyTensorOp' from 'torchvision.ops.misc' (/root/miniconda3/envs/MV/lib/python3.8/site-packages/torchvision/ops/misc.py)
Traceback (most recent call last):File "./tools/launch.py", line 192, in <module>main()File "./tools/launch.py", line 187, in mainraise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['./configs/r50_deformable_detr.sh']' returned non-zero exit status 1.
错误信息表明,程序在导入 torchvision.ops.misc
模块时无法找到 _NewEmptyTensorOp
。这是由于 torchvision
的版本和代码中引用的 API 不兼容导致的。
首先检查torch和torchvision对Deformable-DETR的兼容情况
之后可以尝试手动替换或者修复_NewEmptyTensorOp
的导入问题,在文件 util/misc.py
中,定位到:
from torchvision.ops.misc import _NewEmptyTensorOp
将这一行代码替换为:
try:from torchvision.ops.misc import _NewEmptyTensorOp
except ImportError:# Define a fallback for _NewEmptyTensorOp if not availableclass _NewEmptyTensorOp(torch.autograd.Function):@staticmethoddef forward(ctx, x, new_shape):return x.new_empty(new_shape)@staticmethoddef backward(ctx, grad):return grad, None
这个的作用是
1. 检查并修复 _NewEmptyTensorOp
导入问题
_NewEmptyTensorOp
是 torchvision
提供的一个工具类,通常用于操作张量。它可能因以下原因出现问题:
- 版本不兼容:不同版本的
torchvision
对_NewEmptyTensorOp
的定义和位置可能不同。 - 移除或改名:较新版本可能已删除该类,导致导入失败。
修改意义:
通过使用 try-except
块,动态检查 _NewEmptyTensorOp
是否可用。如果不可用,则提供一个自定义实现,确保代码在旧版或新版 torchvision
上都可以运行。
2. 自定义 _NewEmptyTensorOp
的实现
提供了一个备用实现,以应对 _NewEmptyTensorOp
无法导入的情况:
forward
方法:根据新的形状new_shape
创建一个空张量,并保持原张量的类型和设备。backward
方法:提供梯度回传支持,使操作可以参与反向传播。
作用:
- 保持代码完整性:避免因缺少
_NewEmptyTensorOp
而导致程序中断。 - 支持梯度计算:自定义的
forward
和backward
确保张量操作在训练中不会破坏自动微分机制。
在修复后,清理缓存和编译的 CUDA 操作,重新运行命令:
python setup.py clean
python setup.py build develop
之后重新执行上面开始训练的脚本
会自动下载权重文件,之后开始训练:
# 下载权重文件
Namespace(aux_loss=True, backbone='resnet50', batch_size=2, bbox_loss_coef=5, cache_mode=False, clip_max_norm=0.1, cls_loss_coef=2, coco_panoptic_path=None, coco_path='./data/coco', dataset_file='coco', dec_layers=6, dec_n_points=4, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=1024, dist_backend='nccl', dist_url='env://', distributed=True, dropout=0.1, enc_layers=6, enc_n_points=4, epochs=50, eval=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, lr=0.0002, lr_backbone=2e-05, lr_backbone_names=['backbone.0'], lr_drop=40, lr_drop_epochs=None, lr_linear_proj_mult=0.1, lr_linear_proj_names=['reference_points', 'sampling_offsets'], mask_loss_coef=1, masks=False, nheads=8, num_feature_levels=4, num_queries=300, num_workers=2, output_dir='exps/r50_deformable_detr', position_embedding='sine', position_embedding_scale=6.283185307179586, rank=0, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, sgd=False, start_epoch=0, two_stage=False, weight_decay=0.0001, with_box_refine=False, world_size=1)
/root/miniconda3/envs/MV/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.warnings.warn(
/root/miniconda3/envs/MV/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████████████████████████████████████████████████████████| 97.8M/97.8M [00:08<00:00, 12.0MB/s]# 开始训练,一共50轮
Start training
/root/autodl-tmp/Deformable_DETR/models/position_encoding.py:49: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
/root/miniconda3/envs/MV/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Epoch: [0] [ 0/500] eta: 0:17:25 lr: 0.000200 class_error: 90.91 grad_norm: 90.38 loss: 33.3953 (33.3953) loss_ce: 2.1406 (2.1406) loss_bbox: 1.6266 (1.6266) loss_giou: 1.6883 (1.6883) loss_ce_0: 2.0890 (2.0890) loss_bbox_0: 1.7336 (1.7336) loss_giou_0: 1.7052 (1.7052) loss_ce_1: 2.1477 (2.1477) loss_bbox_1: 1.7186 (1.7186) loss_giou_1: 1.6883 (1.6883) loss_ce_2: 2.2523 (2.2523) loss_bbox_2: 1.7217 (1.7217) loss_giou_2: 1.6883 (1.6883) loss_ce_3: 2.3523 (2.3523) loss_bbox_3: 1.6521 (1.6521) loss_giou_3: 1.6883 (1.6883) loss_ce_4: 2.1289 (2.1289) loss_bbox_4: 1.6850 (1.6850) loss_giou_4: 1.6883 (1.6883) loss_ce_unscaled: 1.0703 (1.0703) class_error_unscaled: 90.9091 (90.9091) loss_bbox_unscaled: 0.3253 (0.3253) loss_giou_unscaled: 0.8442 (0.8442) cardinality_error_unscaled: 294.5000 (294.5000) loss_ce_0_unscaled: 1.0445 (1.0445) loss_bbox_0_unscaled: 0.3467 (0.3467) loss_giou_0_unscaled: 0.8526 (0.8526) cardinality_error_0_unscaled: 293.0000 (293.0000) loss_ce_1_unscaled: 1.0739 (1.0739) loss_bbox_1_unscaled: 0.3437 (0.3437) loss_giou_1_unscaled: 0.8442 (0.8442) cardinality_error_1_unscaled: 294.5000 (294.5000) loss_ce_2_unscaled: 1.1262 (1.1262) loss_bbox_2_unscaled: 0.3443 (0.3443) loss_giou_2_unscaled: 0.8442 (0.8442) cardinality_error_2_unscaled: 294.5000 (294.5000) loss_ce_3_unscaled: 1.1762 (1.1762) loss_bbox_3_unscaled: 0.3304 (0.3304) loss_giou_3_unscaled: 0.8442 (0.8442) cardinality_error_3_unscaled: 294.5000 (294.5000) loss_ce_4_unscaled: 1.0644 (1.0644) loss_bbox_4_unscaled: 0.3370 (0.3370) loss_giou_4_unscaled: 0.8442 (0.8442) cardinality_error_4_unscaled: 294.5000 (294.5000) time: 2.0913 data: 0.0000 max mem: 2595
Epoch: [0] [ 10/500] eta: 0:04:08 lr: 0.000200 class_error: 100.00 grad_norm: 59.73 loss: 33.8059 (36.6621) loss_ce: 2.1492 (2.2435) loss_bbox: 1.9925 (2.2554) loss_giou: 1.7524 (1.7051) loss_ce_0: 2.0454 (2.0046) loss_bbox_0: 2.0563 (2.2853) loss_giou_0: 1.7421 (1.7168) loss_ce_1: 2.1477 (2.1195) loss_bbox_1: 2.0127 (2.2745) loss_giou_1: 1.7430 (1.7075) loss_ce_2: 2.1196 (2.0959) loss_bbox_2: 1.9863 (2.2725) loss_giou_2: 1.7475 (1.7035) loss_ce_3: 2.0785 (2.1923) loss_bbox_3: 1.9909 (2.2674) loss_giou_3: 1.7291 (1.7033) loss_ce_4: 2.1289 (2.1414) loss_bbox_4: 1.9995 (2.2686) loss_giou_4: 1.7426 (1.7052) loss_ce_unscaled: 1.0746 (1.1217) class_error_unscaled: 100.0000 (93.7190) loss_bbox_unscaled: 0.3985 (0.4511) loss_giou_unscaled: 0.8762 (0.8525) cardinality_error_unscaled: 295.0000 (294.2273) loss_ce_0_unscaled: 1.0227 (1.0023) loss_bbox_0_unscaled: 0.4113 (0.4571) loss_giou_0_unscaled: 0.8710 (0.8584) cardinality_error_0_unscaled: 295.0000 (293.7727) loss_ce_1_unscaled: 1.0739 (1.0597) loss_bbox_1_unscaled: 0.4025 (0.4549) loss_giou_1_unscaled: 0.8715 (0.8538) cardinality_error_1_unscaled: 295.0000 (294.2273) loss_ce_2_unscaled: 1.0598 (1.0479) loss_bbox_2_unscaled: 0.3973 (0.4545) loss_giou_2_unscaled: 0.8738 (0.8518) cardinality_error_2_unscaled: 295.0000 (294.2273) loss_ce_3_unscaled: 1.0393 (1.0962) loss_bbox_3_unscaled: 0.3982 (0.4535) loss_giou_3_unscaled: 0.8645 (0.8516) cardinality_error_3_unscaled: 295.0000 (294.2273) loss_ce_4_unscaled: 1.0644 (1.0707) loss_bbox_4_unscaled: 0.3999 (0.4537) loss_giou_4_unscaled: 0.8713 (0.8526) cardinality_error_4_unscaled: 295.0000 (294.2273) time: 0.5081 data: 0.0000 max mem: 8155
Epoch: [0] [ 20/500] eta: 0:03:13 lr: 0.000200 class_error: 96.00 grad_norm: 66.10 loss: 33.1548 (35.3786) loss_ce: 2.1193 (2.1767) loss_bbox: 1.6732 (2.0712) loss_giou: 1.7230 (1.7049) loss_ce_0: 1.9877 (2.0215) loss_bbox_0: 1.7229 (2.0959) loss_giou_0: 1.7421 (1.7148) loss_ce_1: 2.1756 (2.1125) loss_bbox_1: 1.6997 (2.0856) loss_giou_1: 1.7295 (1.7072) loss_ce_2: 2.0581 (2.0707) loss_bbox_2: 1.6692 (2.0807) loss_giou_2: 1.7245 (1.7052) loss_ce_3: 2.0403 (2.1157) loss_bbox_3: 1.6682 (2.0794) loss_giou_3: 1.7242 (1.7036) loss_ce_4: 2.1486 (2.1487) loss_bbox_4: 1.6742 (2.0805) loss_giou_4: 1.7218 (1.7038) loss_ce_unscaled: 1.0597 (1.0884) class_error_unscaled: 100.0000 (92.4284) loss_bbox_unscaled: 0.3346 (0.4142) loss_giou_unscaled: 0.8615 (0.8525) cardinality_error_unscaled: 295.0000 (293.8810) loss_ce_0_unscaled: 0.9938 (1.0107) loss_bbox_0_unscaled: 0.3446 (0.4192) loss_giou_0_unscaled: 0.8710 (0.8574) cardinality_error_0_unscaled: 295.0000 (293.6429) loss_ce_1_unscaled: 1.0878 (1.0563) loss_bbox_1_unscaled: 0.3399 (0.4171) loss_giou_1_unscaled: 0.8648 (0.8536) cardinality_error_1_unscaled: 295.0000 (293.8810) loss_ce_2_unscaled: 1.0290 (1.0354) loss_bbox_2_unscaled: 0.3338 (0.4161) loss_giou_2_unscaled: 0.8623 (0.8526) cardinality_error_2_unscaled: 295.0000 (293.8810) loss_ce_3_unscaled: 1.0201 (1.0579) loss_bbox_3_unscaled: 0.3336 (0.4159) loss_giou_3_unscaled: 0.8621 (0.8518) cardinality_error_3_unscaled: 295.0000 (293.8810) loss_ce_4_unscaled: 1.0743 (1.0744) loss_bbox_4_unscaled: 0.3348 (0.4161) loss_giou_4_unscaled: 0.8609 (0.8519) cardinality_error_4_unscaled: 295.0000 (293.8810) time: 0.3184 data: 0.0000 max mem: 8155