Pix2Seq 算法阅读记录

目录

前向传播过程

训练过程:

网络结构


前向传播过程

batch_preds--> tgt-->tgt=cat(tgt, padding)-->tgt_embedding-->tgt_mask,tgt_padding_mask

以NLP的角度,tgt 代表了 词汇表的长度,encoder部分直接对图像进行处理,输出序列的patch,16倍下采样,最终输出的序列长度为576。

decoder中,根据句子的最大长度生成了掩码mask,下三角矩阵全为0.还有paddding mask,第一个为False,其余全为填充的,第一个是开始标志。

decoder的输入序列初始化 全为填充的token,为406,外加一个开始标志,因此输入的词向量嵌入都根据它初始化。

decoder的输入包括 encoder的输出序列+位置编码, 初始化的词向量嵌入, 掩码mask, padding掩码。

因为只检测一张图片,而NLP任务中需要预测一句话,可能包含多个单词。所以,输出只采用了

return self.output(preds)[:, length-1, :]

来进行预测。 

注意

NLP中的语句生成,贪心所搜,与top_k_top_p_filtering相关见这里

采用自回归方式生成预测,前向过程后生成的预测结果可视化如下

其中的404由

num_bins + class

 得出。实际离散化后包含406个标记,因为加入了开始(404)和结束(405)标记。

得到上述的网络的输出预测后,开始对这些进行处理。 

1、 得到第一个结束标志 EOS 的索引 index
2、 判断 index-1 是否是 5 的倍数,若不是,则本次的预测不进行处理,默认没有检测到任何目标
3、 去掉额外填充的噪声
4、 迭代的每次拿出5个token
5、 前4维 为 box的信息,第5维为类别信息
6、 预测的表示类别的离散化token需要减去 num_bins,才是最后的类别
7、 box 反离散化, box / (num_bins - 1), 这个是输出特征尺度下的归一化的box的坐标
8、 将box的尺度返回输入图片的尺度, box的信息为 (Xmin,Ymin,Xmax,Ymax)

训练过程:

至于文中的 各种训练的设置,包括序列增强技术,没有看到,没有仔细的看。

损失函数,文章中说用的极大似然估计,最终的形式是交叉熵损失函数。

网络结构

EncoderDecoder((encoder): Encoder((model): VisionTransformer((patch_embed): PatchEmbed((proj): Conv2d(3, 384, kernel_size=(16, 16), stride=(16, 16))(norm): Identity())(pos_drop): Dropout(p=0.0, inplace=False)(blocks): Sequential((0): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(1): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(2): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(3): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(4): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(5): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(6): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(7): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(8): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(9): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(10): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity())(11): Block((norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(attn): Attention((qkv): Linear(in_features=384, out_features=1152, bias=True)(attn_drop): Dropout(p=0.0, inplace=False)(proj): Linear(in_features=384, out_features=384, bias=True)(proj_drop): Dropout(p=0.0, inplace=False))(ls1): LayerScale()(drop_path1): Identity()(norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(mlp): Mlp((fc1): Linear(in_features=384, out_features=1536, bias=True)(act): GELU()(drop1): Dropout(p=0.0, inplace=False)(fc2): Linear(in_features=1536, out_features=384, bias=True)(drop2): Dropout(p=0.0, inplace=False))(ls2): LayerScale()(drop_path2): Identity()))(norm): LayerNorm((384,), eps=1e-06, elementwise_affine=True)(fc_norm): Identity()(head): Identity())(bottleneck): AdaptiveAvgPool1d(output_size=256))(decoder): Decoder((embedding): Embedding(407, 256)(decoder_pos_drop): Dropout(p=0.05, inplace=False)(decoder): TransformerDecoder((layers): ModuleList((0): TransformerDecoderLayer((self_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(multihead_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.1, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.1, inplace=False)(dropout2): Dropout(p=0.1, inplace=False)(dropout3): Dropout(p=0.1, inplace=False))(1): TransformerDecoderLayer((self_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(multihead_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.1, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.1, inplace=False)(dropout2): Dropout(p=0.1, inplace=False)(dropout3): Dropout(p=0.1, inplace=False))(2): TransformerDecoderLayer((self_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(multihead_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.1, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.1, inplace=False)(dropout2): Dropout(p=0.1, inplace=False)(dropout3): Dropout(p=0.1, inplace=False))(3): TransformerDecoderLayer((self_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(multihead_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.1, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.1, inplace=False)(dropout2): Dropout(p=0.1, inplace=False)(dropout3): Dropout(p=0.1, inplace=False))(4): TransformerDecoderLayer((self_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(multihead_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.1, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.1, inplace=False)(dropout2): Dropout(p=0.1, inplace=False)(dropout3): Dropout(p=0.1, inplace=False))(5): TransformerDecoderLayer((self_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(multihead_attn): MultiheadAttention((out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True))(linear1): Linear(in_features=256, out_features=2048, bias=True)(dropout): Dropout(p=0.1, inplace=False)(linear2): Linear(in_features=2048, out_features=256, bias=True)(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)(dropout1): Dropout(p=0.1, inplace=False)(dropout2): Dropout(p=0.1, inplace=False)(dropout3): Dropout(p=0.1, inplace=False))))(output): Linear(in_features=256, out_features=407, bias=True)(encoder_pos_drop): Dropout(p=0.05, inplace=False))
)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/315841.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Spring Cloud Gateway + Nacos 灰度发布

前言 本文将会使用 SpringCloud Gateway 网关组件配合 Nacos 实现灰度发布&#xff08;金丝雀发布&#xff09; 环境搭建 创建子模块服务提供者 provider&#xff0c;网关模块 gateway 父项目 pom.xml 配置 <?xml version"1.0" encoding"UTF-8"?…

Redis:原理+项目实战——Redis实战3(Redis缓存最佳实践(问题解析+高级实现))

&#x1f468;‍&#x1f393;作者简介&#xff1a;一位大四、研0学生&#xff0c;正在努力准备大四暑假的实习 &#x1f30c;上期文章&#xff1a;Redis&#xff1a;原理项目实战——Redis实战2&#xff08;Redis实现短信登录&#xff08;原理剖析代码优化&#xff09;&#x…

Spring Boot快速搭建一个简易商城项目【四,优化购物车篇】

在之前的基础上继续将购物车进行完善&#xff1a;全选&#xff0c;删除&#xff0c;加减购物车数量 效果&#xff1a; 全选&#xff1a; 计算价格&#xff1a; //计算总价function jisuan(){let total 0;$(".th").each((i,el)>{//each遍历 i下标 el指的是当前的…

跟着动感音乐一起运动起来,健身房的动感单车中文教学

一、教程描述 目前流行的动感单车教程&#xff0c;大多数都是来自国外的&#xff0c;可能听不懂在讲些什么&#xff0c;本套教程是国内的中文教程&#xff0c;现场教学的感觉很好&#xff0c;配上健身房专用的动感单车音乐&#xff0c;很快就会唤醒全身的运动细胞&#xff0c;…

Filezilla使用

服务端 点击安装包 点击我接受 点击下一步 点击下一步 点击下一步 点击安装即可 配置用户组&#xff0c;点击编辑&#xff0c;出现组点击 点击添加&#xff0c;点击确定即可 配置用户&#xff0c;点击编辑点击用户 点击添加&#xff0c;设置用户名&#xff…

linux中最常用的帮助命令

文章目录 linux中最常用的帮助命令此man非man的意思man 的格式man的操作类似于whatis命令类似于apropos命令使用man的小技巧 你是干什么的 whatis拓展 指定目录的定位 whereis使用语法实例单独查找文件 刚刚好合适的 apropos 命令更多信息 linux中最常用的帮助命令 仅个人想法&…

2023-12-27 LeetCode每日一题(保龄球游戏的获胜者)

2023-12-27每日一题 一、题目编号 2660. 保龄球游戏的获胜者二、题目链接 点击跳转到题目位置 三、题目描述 给你两个下标从 0 开始的整数数组 player1 和 player2 &#xff0c;分别表示玩家 1 和玩家 2 击中的瓶数。 保龄球比赛由 n 轮组成&#xff0c;每轮的瓶数恰好为…

IO作业2.0

思维导图 1> 使用fread、fwrite完成两个文件的拷贝 #include <stdio.h> #include <string.h> #include <stdlib.h> int main(int argc, const char *argv[]) {if(argc ! 3) //判断外部参数 {printf("The terminal format is incorrect\n");r…

k8s-yaml格式

三种常见的项目发布方式&#xff1a; 蓝绿发布&#xff1a; 金丝雀发布&#xff08;灰度发布&#xff09;&#xff1a; 滚动发布&#xff1a; 应用程序升级&#xff0c;面临的最大的问题&#xff0c;就是新旧业务的更换&#xff0c;立项--定稿--需求发布--开发--测试--发布&…

纠删码ReedSolomon

随着大数据技术的发展&#xff0c;HDFS作为Hadoop的核心模块之一得到了广泛的应用。为了数据的可靠性&#xff0c;HDFS通过多副本机制来保证。在HDFS中的每一份数据都有两个副本&#xff0c;1TB的原始数据需要占用3TB的磁盘空间&#xff0c;存储利用率只有1/3。而且系统中大部分…

NGUI基础-图集制作(保姆级教程)

目录 图集是什么 如何打开图集制作工具 制作步骤 图集的三个关键配置 相关参数介绍 Atlas Material Texture Padding Tim Alpha PMA shader Unity Packer TrueColor Auto-upgrade Force Square Pre-processor 图集是什么 Unity图集&#xff08;Sprite Atlas&…

Jupyter Notebook的10个常用扩展介绍

Jupyter Notebook&#xff08;前身为IPython Notebook&#xff09;是一种开源的交互式计算和数据可视化的工具&#xff0c;广泛用于数据科学、机器学习、科学研究和教育等领域。它提供了一个基于Web的界面&#xff0c;允许用户创建和共享文档&#xff0c;这些文档包含实时代码、…