【可变形注意力(1)】Multi-scale Deformable Attention Transformers 多尺度变形注意力

文章目录

  • 前言
  • 论文 《Deformable DETR: Deformable Transformers for End-to-End Object Detection》的多尺度变形注意力的解读
    • DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION
  • **2.** Deformable Attention Module
      • Deformable Attention Module
  • 3. Multi-Scale Deformable Attention Module
      • Multi-scale Deformable Attention Module
      • Deformable Transformer Encoder
      • Deformable Transformer Decoder.
    • A.2 CONSTRUCTING MULT-SCALE FEATURE MAPS FOR DEFORMABLE DETR
  • 参考

前言

1、大多数现代物体检测框架受益于多尺度特征图 (Liu等人,2020)。

2、Deformable DETR: Deformable Transformers for End-to-End Object Detection 提出的可变形注意力模块可以自然地扩展为多尺度特征图。

论文 《Deformable DETR: Deformable Transformers for End-to-End Object Detection》的多尺度变形注意力的解读

DEFORMABLE TRANSFORMERS FOR END-TO-END OBJECT DETECTION

2. Deformable Attention Module

  • Given an input feature map x with size of CHW, let q index a query element with content feature zq and a 2-d reference point pq, the deformable attention feature is calculated by:

img

  • where m indexes the attention head (M is number of heads), k indexes the sampled keys, and K is the total sampled key number (KHW).

img

Deformable Attention Module

  • Δpmqk and Amqk denote the sampling offset and attention weight of the kth sampling point in the mth attention head, respectively.
  • As pqpmqk is fractional**, bilinear interpolation is applied.
  • Both Δpmqk and Amqk are obtained via linear projection over the query feature zq.
  • The query feature zq is fed to a linear projection operator of 3MK channels, where the first 2MK channels encode the sampling offsets Δpmqk, and the remaining MK channels are fed to a softmax operator to obtain the attention weights Amqk.

To be brief, two sets of channels are used to encode offsets in x and y directions. The remaining one set of channels is to encode attention weight.

These two sets of offsets are learnt, which has the similar concept in DCN.

  • Let Nq be the number of query elements, when MK is relatively small, the complexity of the deformable attention module is of:

img

  • When it is applied in DETR encoder, where Nq=HW, the complexity becomes O(HWC²), which is of linear complexity with the spatial size.
  • When it is applied as the cross-attention modules in DETR decoder, where Nq=N (N is the number of object queries), the complexity becomes O(NKC²), which is irrelevant to the spatial size HW.

Deformable Attention Module

Deformable Attention Module,在图像特征上,应用Transformer注意力的核心问题是:it would look over all possible spatial locations。
为了解决这个问题,我们提出了一个可变形的注意模块(deformable attention module)。
受可变形卷积的启发 (Dai等,2017; Zhu等,2019b),可变形注意模块只关注参考点周围的一小部分关键采样点,而与特征图的空间大小无关,如图2所示,通过为每个查询仅分配少量固定数量的键,可以减轻收敛和特征空间分辨率的问题。

在这里插入图片描述

Given an input feature map x ∈ R C × H × W x ∈ R^{C×H×W} xRC×H×W, let q q q index a query element with content feature z q z_q zq and a 2-d reference point p q p_q pq,让 q q q索引一个具有内容特征 z q z_q zq和二维参考点 p q p_q pq的查询元素
the deformable attention feature is calculated by
在这里插入图片描述
where m m m indexes the attention head,
k k k indexes the sampled keys, and K K K is the total sampled key number (K << HW).
∆ p m q k ∆p_{mqk} pmqk and A m q k A_{mqk} Amqk denote the sampling offset and attention weight of the k t h k_{th} kth sampling point in the m t h m_{th} mth attention head, respectively.
其中,m为注意头索引,k为采样键索引,K为总采样键数 (K << HW)。∆ pmqk和Amqk分别表示第m个注意头中第k个采样点的采样偏移和注意权重。

The scalar attention weight A m q k A_{mqk} Amqk lies in the range [0, 1], normalized by
在这里插入图片描述

∆ p m q k ∈ R 2 ∆p_{mqk} ∈ R^2 pmqkR2是具有无约束范围的二维实数

p q + ∆ p m q k p_q+∆p_{mqk} pq+pmqk为分数时,应用双线性插值,如Dai等人 (2017) 在计算 x ( p q + ∆ p m q k ) x(p_q + ∆p_{mqk}) x(pq+pmqk)

∆ p m q k ∆p_{mqk} pmqk A m q k A_{mqk} Amqk都是通过对query feature z q z_q zq 的线性投影获得的

In implementation, the query feature z q z_q zq is fed to a linear projection operator of 3MK channels, where the first 2MK channels encode the sampling offsets ∆ p m q k ∆p_{mqk} pmqk, and the remaining MK channels are fed to a softmax operator to obtain the attention weights A m q k A_{mqk} Amqk.

可变形注意模块设计用于将卷积特征图作为key elements进行处理。
N q N_q Nq为query elements的数量,当mk相对较小时,可变形注意模块的复杂度为 O ( 2 N q C 2 m i n ( H W C 2 , N q K C 2 ) ) O(2N_qC^2 min(HWC^2,N_qKC^2)) O(2NqC2min(HWC2NqKC2)) (详见附录A.1)。
当它应用于DETR编码器时,其中 q = H W q = HW q=HW,复杂度变为 O ( H W C 2 ) O(HWC^2) O(HWC2),它与空间大小呈线性复杂度。
When it is applied as the cross-attention modules in DETR decoder, where Nq = N (N is the number of object queries), the complexity becomes O ( N K C 2 ) O(NKC^2) O(NKC2), which is irrelevant to the spatial size H W HW HW.

3. Multi-Scale Deformable Attention Module

Multi-scale deformable attention modules to replace the Transformer attention modules processing feature maps.

  • Let {xl}, where l from 1 to L, be the input multi-scale feature maps, where xl has the size of C×Hl×Wl. Let ^pq ∈ [0, 1]² be the normalized coordinates of the reference point for each query element q, then the multi-scale deformable attention module is applied as:

img

  • The normalized coordinates (0, 0) and (1, 1) indicate the top-left and the bottom-right image corners, respectively. Φl(^pq) re-scales the normalized coordinates ^pq to the input feature map of the l-th level.
  • The multi-scale deformable attention is very similar to the previous single-scale version, except that it samples LK points from multi-scale feature maps instead of K points from single-scale feature maps.
  • The proposed attention module will degenerate to deformable convolution, as in DCN, when L=1, K=1, and Wm is fixed as an identity matrix.

The proposed (multi-scale) deformable attention module can also be perceived as an efficient variant of Transformer attention, where a pre-filtering mechanism is introduced by the deformable sampling locations.

Multi-scale Deformable Attention Module

Multi-scale Deformable Attention Module.大多数现代物体检测框架受益于多尺度特征图 (Liu等人,2020)。我们提出的可变形注意力模块可以自然地扩展为多尺度特征图。

Let x l l = 1 L {x^l}^L_{l=1} xll=1L be the input multi-scale feature maps, where x l ∈ R C × H l × W l x^l ∈ R^{C×H_l×W_l} xlRC×Hl×Wl.
Let p q ∈ [ 0 , 1 ] 2 p_q∈ [0, 1]^2 pq[0,1]2 be the normalized coordinates of the reference point for each query element q q q,
then the multi-scale deformable attention module is applied as

在这里插入图片描述
where m m m indexes the attention head,
l l l indexes the input feature level,
k k k indexes the sampling point.
∆ p m l q k ∆p_{mlqk} pmlqk and A m l q k A_{mlqk} Amlqk denote the sampling offset and attention weight of the k t h k^{th} kth sampling point in the lth feature level and the m t h m^{th} mth attention head, respectively.
The scalar attention weight A m l q k A_{mlqk} Amlqk is normalized by在这里插入图片描述

Here, we use normalized coordinates p q ∈ [ 0 , 1 ] 2 p_q ∈ [0, 1]^2 pq[0,1]2 for the clarity of scale formulation, in which the normalized coordinates (0, 0) and (1, 1) indicate the top-left and the bottom-right image corners, respectively.

Function φ l ( p q ) φ_l(p_q) φl(pq) in Equation 3 rescales the normalized coordinates p q p_q pq to the input feature map of the l t h l^{th} lth level.
方程3中的函数 φ将归一化坐标 p q p_q pq重新缩放到第 l l l级的输入特征图

多尺度可变形注意力与以前的单尺度版本非常相似,不同之处在于它从多尺度特征图中采样LK点,而不是从单尺度特征图中采样K点

The proposed attention module will degenerate to deformable convolution when L=1 K=1 and W ’ m ∈ R C v ∗ C W’_m ∈ R^{C_v*C} WmRCvC is
fixed as an identity matrix单位矩阵

Deformable convolution is designed for single-scale inputs, focusing only on one sampling point for each attention head. How-
ever, our multi-scale deformable attention looks over multiple sampling points from multi-scale in-puts. The proposed (multi-scale) deformable attention module can also be perceived as an efficient variant of Transformer attention, where a pre-filtering mechanism is introduced by the deformable sampling locations
当采样点遍历所有可能的位置时,the (multi-scale) deformable attention module is equivalent to Transformer attention

Deformable Transformer Encoder

我们用提出的多尺度可变形注意模块代替了DETR中处理特征图的Transformer 注意模块。编码器的输入和输出都是具有相同分辨率的多尺度特征图。

在编码器中,我们从ResNet (He等人,2016) 中的阶段C3到C5的输出特征图中提取多尺度特征图(通过1 × 1卷积变换),其中 C l C_l Cl的分辨率比输入图像低 2 l 2_l 2l。最低分辨率特征图 x L x_L xL是通过最终C5阶段的3 × 3步幅2卷积获得的,表示为 C 6 C_6 C6。所有多尺度特征图都是C = 256通道。注意,FPN (Lin等,2017a) 中的自顶向下结构没有使用,因为我们提出的多尺度可变形注意力本身可以在多尺度特征图之间交换信息。附录a2还说明了多尺度特征图的构建。5.2部分的实验表明,添加FPN不会改善性能。
In application of the multi-scale deformable attention module in encoder, the output are of multi-scale feature maps with the same resolutions as the input. Both the key and query elements are of pixels from the multi-scale feature maps.
对于每个查询像素,参考点是其自身。为了确定每个查询像素位于哪个特征级别,除了位置嵌入之外,我们还在特征表示中添加了比例级别嵌入 (表示为el)。 与固定编码的位置嵌入不同,比例级嵌入 {el} 是随机初始化的,并与网络联合训练。

Deformable Transformer Decoder.

There are cross-attention and self-attention modules in the decoder. The query elements for both types of attention modules are of object queries. In the cross-attention modules, object queries extract features from the feature maps, where the key elements are of the output feature maps from the encoder. In the self-attention modules, object queries interact with each other, where the key elements are of the object queries. Since our proposed deformable attention module is designed for processing convolutional feature maps as key elements, we only replace each cross-attention module to be the multi-scale deformable attention module, while leaving the self-attention modules unchanged.
For each object query, the 2-d normalized coordinate of the reference point p q p_q pq is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function.参考点 “p” 是通过可学习的线性投影和sigmoid函数从其object query embedding中预测的。

因为多尺度可变形注意模块提取参考点周围的图像特征,所以我们让检测头预测边界框作为相对偏移w.r.t.参照点,进一步降低优化难度。参考点用作框中心的初始猜测。检测头预测相对偏移w.r.t.参考点。查看附录A.3的详细信息。这样,学习的解码器注意力将与预测的边界框具有很强的相关性,这也加速了训练收敛。

通过用DETR中的可变形注意模块替换Transformer注意模块,我们建立了一个高效且快速的会聚检测系统,称为可变形DETR (参见图1)。
在这里插入图片描述

A.2 CONSTRUCTING MULT-SCALE FEATURE MAPS FOR DEFORMABLE DETR

正如4.1节所讨论并在图4中所示,编码器{xl}L−1 L =1 (L = 4)的输入多尺度特征映射是从ResNet中C3到C5阶段的输出特征映射中提取的(He et al., 2016)(通过1×1卷积进行转换)。最低分辨率的feature map xL是通过在最后的C5 stage上的3×3 stride 2卷积获得的。注意,我们没有使用FPN (Lin et al., 2017a),因为我们提出的多尺度变形注意本身可以在多尺度特征图之间交换信息。

在这里插入图片描述

参考

https://sh-tsang.medium.com/review-deformable-transformers-for-end-to-end-object-detection-e29786ef2b4c

(正文完)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/151208.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

鼎汉电源模块维修DHXD-TE1直流屏充电模块

鼎汉电源模块维修常见系列包括&#xff1a;DHXD-E&#xff0c;DHXD-H1&#xff0c;DHXD-H2&#xff0c;DHXD-H3&#xff0c;DHXD-H4等系列模块维修 通信电源维修品牌&#xff1a;英可瑞,许继,艾默生,通合,动力源,九洲,华隆,合欣,泰坦等 直流屏模块故障和解决办法&#xff1a; …

统计学习方法 支持向量机(下)

文章目录 统计学习方法 支持向量机&#xff08;下&#xff09;非线性支持向量机与和核函数核技巧正定核常用核函数非线性 SVM 序列最小最优化算法两个变量二次规划的求解方法变量的选择方法SMO 算法 统计学习方法 支持向量机&#xff08;下&#xff09; 学习李航的《统计学习方…

【STM32】STM32中断体系

一、STM32的NVIC和起始代码中的ISP 1.NVIC(嵌套向量中断控制器) (1)数据手册中相关部分浏览 (2)地址映射时0地址映射到Flash或SRAM (3)中断向量表可以被人为重新映射&#xff0c;一般用来IAP中 (4)STM32采用一维的中断向量表 (5)中断优先级设置有点复杂&#xff0c;后面细说 1…

Java练习题2020-2

"统计1到N的整数中,除了1和自身之外&#xff0c;至少还能被两个数整除的数的个数 输入说明&#xff1a;整数 N(N<10000)&#xff1b; 输出说明&#xff1a;符合条件的数的个数 输入样例&#xff1a;10 输出样例&#xff1a;3 (说明&#xff1a;样例中符合条件的3个数是…

【2024秋招】2023-9-16 贝壳后端开发二面

1 自我介绍 2 秒杀系统 2.1 超卖怎么解决 3 redis 3.1 过期策略 3.2 过期算法 4 kafka 4.1 说一说你对kafka的了解 4.2 如何保证事务性消息 4.3 如何保证消息不丢失 4.4 消息队列的两种通信方式 点对点模式 如上图所示&#xff0c;点对点模式通常是基于拉取或者轮询…

【算法练习Day30】无重叠区间 划分字母区间合并区间

​&#x1f4dd;个人主页&#xff1a;Sherry的成长之路 &#x1f3e0;学习社区&#xff1a;Sherry的成长之路&#xff08;个人社区&#xff09; &#x1f4d6;专栏链接&#xff1a;练题 &#x1f3af;长路漫漫浩浩&#xff0c;万事皆有期待 文章目录 无重叠区间划分字母区间合并…

docker应用部署---nginx部署的配置

1. 搜索nginx镜像 docker search nginx2. 拉取nginx镜像 docker pull nginx3. 创建容器&#xff0c;设置端口映射、目录映射 # 在/root目录下创建nginx目录用于存储nginx数据信息 mkdir ~/nginx cd ~/nginx mkdir conf cd conf# 在~/nginx/conf/下创建nginx.conf文件,粘贴下…

yum--centos 和apt --ubuntu

centos安装软件 搜索语法&#xff1a;yum -y search 软件名称 安装软件前可以先去搜一下看看能用yum中有这个软件吗 安装语法&#xff1a;yum -y install 软件名称 写上 -y 意思是不用手动确认&#xff0c;直接安装 卸载语法&#xff1a;yum -y remove 软件名称 注…

推荐一个高效测试用例工具:XMind2TestCase..

一、背景 软件测试的核心是什么&#xff1f;毫无疑问是测试分析和测试用例设计&#xff0c;也是日常测试投入最多时间的工作内容之一。 然而&#xff0c;传统的测试用例设计过程有很多痛点&#xff1a; 1、使用Excel表格进行测试用例设计&#xff0c;虽然成本低&#xff0c;但…

3.加载天地图

愿你出走半生,归来仍是少年&#xff01; 上一篇文章构建出来基础的白球&#xff0c;现在需要给它添加底图啦。先上最常用的天地图。 1.天地图 天地图做过Gis开发的应该都知道&#xff0c;需要先申请key然后才能使用。然后天地图是基于XYZ的标准进行切片的&#xff0c;所以直接…

浏览器事件循环 (event loop)

进程与线程 进程 进程的概念 进程是操作系统中的一个程序或者一个程序的一次执行过程&#xff0c;是一个动态的概念&#xff0c;是程序在执行过程中分配和管理资源的基本单位&#xff0c;是操作系统结构的基础。 简单的来说&#xff0c;就是一个程序运行开辟的一块内存空间&a…

win10 + VS2017 编译libjpeg(jpeg-9b)--更新

刚刚写了一篇“win10 VS2017 编译libjpeg&#xff08;jpeg-9b&#xff09;”&#xff0c; 然后就发现&#xff0c;还有一个更好的方法。因此&#xff0c;重新更新了一篇&#xff0c;作为对比与参考。 需要用到的文件&#xff1a; jpeg-9b.zip win32.mak 下载链接链接…