使用 torch.compile 加速视觉Transformer-编程知识

使用 torch.compile 加速视觉Transformer

news/2025/4/2 1:03:58/文章来源:https://www.cnblogs.com/wujianming-110117/p/18801446

使用 torch.compile 加速视觉Transformer

视觉Transformer（ViT）是一个类似 BERT的transformer编码器模型，在大规模的图像集合上，使用有监督方式进行了预训练，就是在分辨率为 224×224 像素的 ImageNet-21k 数据集上预训练的。以下是如何使用这个模型将 COCO 2017 数据集中的一张图像分类为 1,000 个 ImageNet 类别之一的示例，使用的是vit-base-patch16-224检查点。

from transformers import ViTImageProcessor, ViTForImageClassification

from PIL import Image

import requests

import matplotlib.pyplot as plt

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'

image = Image.open(requests.get(url, stream=True).raw)

plt.imshow(image)

plt.axis('off') # 关闭轴

plt.show()

# 加载图像处理器和模型

processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')

model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = processor(images=image, return_tensors="pt")

if torch.cuda.is_available():

inputs = inputs.to('cuda')

model.to('cuda')

outputs = model(**inputs)

logits = outputs.logits

# 该模型预测了1000个ImageNet类中的一个

predicted_class_idx = logits.argmax(-1).item()

print("预测类:", model.config.id2label[predicted_class_idx])

测试图像-埃及猫，如图5-3所示。

图5-3测试图像-埃及猫

输出：

预测类: 埃及猫

模型和环境看起来都很好。接下来，将按照与 ResNet-152 相同的测试流程进行测试，包括在不同模式下测试模型，并在最后评估性能。在每种模式下，将进行 10 次迭代以进行预热，然后进行额外的 20 次迭代，以获得模型的平均推理时间。

n_warmup = 10

n_test = 20

dtype = torch.bfloat16

inference_time=[]

mode=[]

5.1.7 评估视觉Transformer模型在Eager模式下的性能

torch._dynamo.reset()

t_warmup, _ = timed(lambda:model(**inputs), n_warmup, dtype)

t_test, output = timed(lambda:model(**inputs), n_test, dtype)

print(f"平均推理时间ViT(warmup): dt_test={t_warmup} ms")

print(f"平均推理时间ViT(test): dt_test={t_test} ms")

inference_time.append(t_test)

mode.append("eager")

# 该模型预测了1000个ImageNet类中的一个

predicted_class_idx = output.logits.argmax(-1).item()

print("预测类:", model.config.id2label[predicted_class_idx])

输出：

平均推理时间ViT(warmup): dt_test=8.17105770111084 ms

平均推理时间ViT(test): dt_test=7.561385631561279 ms

预测类: 埃及猫

评估视觉Transformer模型在 torch.compile(default) 模式下的性能

torch._dynamo.reset()

model_opt1 = torch.compile(model, fullgraph=True)

t_compilation, _ = timed(lambda:model_opt1(**inputs), 1, dtype)

t_warmup, _ = timed(lambda:model_opt1(**inputs), n_warmup, dtype)

t_test, output = timed(lambda:model_opt1(**inputs), n_test, dtype)

print(f"编译时间: dt_compilation={t_compilation} ms")

print(f"平均推理时间ViT(warmup): dt_test={t_warmup} ms")

print(f"平均推理时间ViT(test): dt_test={t_test} ms")

inference_time.append(t_test)

mode.append("default")

# 该模型预测了1000个ImageNet类中的一个

predicted_class_idx = output.logits.argmax(-1).item()

print("预测类:", model.config.id2label[predicted_class_idx])

输出：

编译时间: dt_compilation=13211.912631988525 ms

平均推理时间ViT(warmup): dt_test=7.065939903259277 ms

平均推理时间ViT(test): dt_test=7.033288478851318 ms

预测类: 埃及猫

评估视觉Transformer模型在 torch.compile(reduce-overhead) 模式下的性能

torch._dynamo.reset()

model_opt2 = torch.compile(model, mode="reduce-overhead", fullgraph=True)

t_compilation, _ = timed(lambda:model_opt2(**inputs), 1, dtype)

t_warmup, _ = timed(lambda:model_opt2(**inputs), n_warmup, dtype)

t_test, output = timed(lambda:model_opt2(**inputs), n_test, dtype)

print(f"编译时间: dt_compilation={t_compilation} ms")

print(f"平均推理时间ViT(warmup): dt_test={t_warmup} ms")

print(f"平均推理时间ViT(test): dt_test={t_test} ms")

inference_time.append(t_test)

mode.append("reduce-overhead")

# 该模型预测了1000个ImageNet类中的一个

predicted_class_idx = output.logits.argmax(-1).item()

print("预测类:", model.config.id2label[predicted_class_idx])

输出：

编译时间: dt_compilation=10051.868438720703 ms

平均推理时间ViT(warmup): dt_test=30.241727828979492 ms

平均推理时间ViT(test): dt_test=3.2375097274780273 ms

预测类: 埃及猫

评估视觉Transformer模型在 torch.compile(最大自动调谐) 模式下的性能

torch._dynamo.reset()

model_opt3 = torch.compile(model, mode="最大自动调谐", fullgraph=True)

t_compilation, _ = timed(lambda:model_opt3(**inputs), 1, dtype)

t_warmup, _ = timed(lambda:model_opt3(**inputs), n_warmup, dtype)

t_test, output = timed(lambda:model_opt3(**inputs), n_test, dtype)

print(f"编译时间: dt_compilation={t_compilation} ms")

print(f"平均推理时间ViT(warmup): dt_test={t_warmup} ms")

print(f"平均推理时间ViT(test): dt_test={t_test} ms")

inference_time.append(t_test)

mode.append("最大自动调谐")

# 该模型预测了1000个ImageNet类中的一个

predicted_class_idx = output.logits.argmax(-1).item()

print("预测类:", model.config.id2label[predicted_class_idx])

输出：

AUTOTUNE convolution(1x3x224x224, 768x3x16x16)

convolution 0.0995 ms 100.0%

triton_convolution_2191 0.2939 ms 33.9%

triton_convolution_2190 0.3046 ms 32.7%

triton_convolution_2194 0.3840 ms 25.9%

triton_convolution_2195 0.4038 ms 24.6%

triton_convolution_2188 0.4170 ms 23.9%

...

AUTOTUNE addmm(197x768, 197x768, 768x768)

bias_addmm 0.0278 ms 100.0%

addmm 0.0278 ms 100.0%

triton_mm_2213 0.0363 ms 76.7%

triton_mm_2212 0.0392 ms 71.0%

triton_mm_2207 0.0438 ms 63.5%

triton_mm_2209 0.0450 ms 61.9%

triton_mm_2206 0.0478 ms 58.2%

triton_mm_2197 0.0514 ms 54.2%

triton_mm_2208 0.0533 ms 52.3%

triton_mm_2196 0.0538 ms 51.8%

...

AUTOTUNE addmm(1x1000, 1x768, 768x1000)

bias_addmm 0.0229 ms 100.0%

addmm 0.0229 ms 100.0%

triton_mm_4268 0.0338 ms 67.8%

triton_mm_4269 0.0338 ms 67.8%

triton_mm_4266 0.0382 ms 59.8%

triton_mm_4267 0.0382 ms 59.8%

triton_mm_4272 0.0413 ms 55.4%

triton_mm_4273 0.0413 ms 55.4%

triton_mm_4260 0.0466 ms 49.1%

triton_mm_4261 0.0466 ms 49.1%

SingleProcess自动调谐需要8.9279秒。

编译时间: dt_compilation=103891.38770103455 ms

平均推理时间ViT(warmup): dt_test=31.742525100708004 ms

平均推理时间ViT(test): dt_test=3.2366156578063965 ms

预测类: 埃及猫

比较在上述四种模式下获得的 ViT 推理时间。

# 绘制条形图

plt.bar(mode, inference_time)

print(inference_time)

print(mode)

# 添加标签和标题

plt.xlabel('mode')

plt.ylabel('推理时间 (ms)')

plt.title('ViT')

# 显示绘图

plt.show()

输出：

[7.561385631561279, 7.033288478851318, 3.2375097274780273, 3.2366156578063965]

['eager', 'default', 'reduce-overhead', '最大自动调谐']

torch.compile 显著提升了 ViT 的性能，在 AMD MI210 上通过 ROCm 提升了超过 2.3 倍，如图5-3所示。

图5-3 torch.compile提升ViT性能，通过 ROCm 提升了超过 2.3 倍

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.hqwc.cn/news/908442.html

如若内容造成侵权/违法违规/事实不符，请联系编程知识网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

推荐技术书《AI芯片开发核心技术详解》（1）、《智能汽车传感器：原理设计应用》（2）、《TVM编译器原理与实践》（3）、《LLVM编译器原理与实践》（4），谢谢

使用 torch.compile 加速视觉Transformer

相关文章

推荐技术书《AI芯片开发核心技术详解》（1）、《智能汽车传感器：原理设计应用》（2）、《TVM编译器原理与实践》（3）、《LLVM编译器原理与实践》（4），谢谢

GPU到GPU通信选项技术

visual stdio 使用CMake

mysql InnoDB的事务

如何使用 OpenAI Agents SDK 构建 MCP

C语言打卡学习第11天（2025.3.30）（补发）

C语言打卡学习第10天（2025.3.29）（补发）

C语言打卡学习第8、9天（2025.3.27、8）（补发）

FastAPI中的Pydantic密码验证机制与实现

PicGo+Github图床配置

独立按键控制 LCD1602 显示不同的谚语

OpenEuler RISC-V 上跑bitcoin（实战版）