模型参数量计算(以transformer为例)

news/2024/12/18 17:13:24/文章来源:https://www.cnblogs.com/zh-jp/p/18615421

前言

模型中常见的可训练层包括卷积层和线性层,这里将给出计算公式并在pytorch下进行验证。

计算模型的参数:

import torch.nn as nndef cal_params(model: nn.Module):num_learnable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)num_non_learnable_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)return num_learnable_params, num_non_learnable_params

卷积层

\[n_\text{param} = (k_h \times k_w \times c_\text{in} + 1) \times c_\text{out} \]

其中,\(k_h,\ k_w\)表示卷积核的高和宽,\(c_\text{in},\ c_\text{out}\)表示输入和输出的通道数,\(+1\)表示偏置项。
举例:对于一个卷积层,输入通道数为3,输出通道数为64,卷积核大小为\(3\times3\),则参数量为:

\[n_\text{param} = (3 \times 3 \times 3 + 1) \times 64 = 1792 \]

测试


conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3)
print(cal_params(conv))
# (1792, 0)

线性层

线性层:\(y = Wx+b\),参数在权重\(W\)和偏置\(b\)中,参数量为:

\[n_\text{param} = (n_\text{in} + 1) \times n_\text{out} \]

举例:对于一个线性层,输入维度为64,输出维度为10,则参数量为:

\[n_\text{param} = (64 + 1) \times 10 = 650 \]

测试

linear = nn.Linear(in_features=64, out_features=10)
print(cal_params(linear))
# (650, 0)

Transformer的参数量计算

Encoder layer

结构图中,Encoder的每一层参数量由Multi-head attention层,Feed forward层和两层LayerNorm层组成,设特征的维度为\(d_\text{model}=d\)

Multi-head attention

共有四个线性层\(W_Q,\ W_K,\ W_V,\ W_O\),参数量为:$$n_\text{atten}=(d + 1) \times d \times 4=4d^2 + 4d$$

Feed forward

由两个线性层组成,中间隐藏层的维度为\(d_\text{ff}\),参数量为:\((d + 1) \times d_\text{ff} + (d_\text{ff} + 1) \times d\),论文中,\(d_\text{ff}=4d\),参数量为:

\[n_\text{ff}=(d + 1) \times 4d + (4d + 1) \times d=8d^2 + 5d \]

LayerNorm

两个LayerNorm层,每个LayerNorm层有两个参数\(\gamma,\ \beta\),参数量为:\(n_\text{ln}=2d\)

整个Encoder layer的参数量为:

\[n_\text{encoder layer} = n_\text{atten} + n_\text{ff} + 2n_\text{ln} = 4d^2 + 4d + 8d^2 + 5d + 4d = 12d^2 + 13d \]

举例\(d_\text{model}=512\)时,Encoder layer的参数量为:

\[n_\text{encoder layer} = 12 \times 512^2 + 13 \times 512 = 3,152,384 \]

Encoder layer的实现:

class AddAndNorm(nn.Module):def __init__(self, d_model, dropout_prob, eps=1e-6):super().__init__()self.w = nn.Parameter(torch.ones(d_model))self.b = nn.Parameter(torch.zeros(d_model))self.eps = epsself.dropout = nn.Dropout(p=dropout_prob)def norm(self, x):mean = x.mean(-1, keepdim=True)std = x.std(-1, keepdim=True)return self.w * (x - mean) / (std + self.eps) + self.bdef forward(self, x, sublayer):return x + self.dropout(sublayer(self.norm(x)))def attention(q, k, v, mask=None, dropout=None):d_k = q.size(-1)scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)if mask is not None:masking_value = -1e9 if scores.dtype == torch.float32 else -1e4scores = scores.masked_fill(mask == 0, masking_value)p_attn = scores.softmax(dim=-1)if dropout is not None:p_attn = dropout(p_attn)return torch.matmul(p_attn, v), p_attnclass MultiHeadedAttention(nn.Module):def __init__(self, h, d_model, dropout_prob=0.1):super().__init__()assert d_model % h == 0self.d_k = d_model // hself.h = hself.w_q = nn.Linear(d_model, d_model)self.w_k = nn.Linear(d_model, d_model)self.w_v = nn.Linear(d_model, d_model)self.w_o = nn.Linear(d_model, d_model)self.attn = Noneself.dropout = nn.Dropout(p=dropout_prob)def forward(self, q, k, v, mask=None):if mask is not None:mask = mask.unsqueeze(1)  # 相同的mask应用于所有的注意力头hbatch_size = q.size(0)# 1) 执行线性变换,将 d_model 维度的 x 分割成 h 个 d_k 维度q = self.w_q(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)k = self.w_k(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)v = self.w_v(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)# 2) 计算注意力x, self.attn = attention(q, k, v, mask=mask, dropout=self.dropout)# 3) 通过线性层连接多头注意力计算完的向量x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)return self.w_o(x)class FeedForward(nn.Module):def __init__(self, d_model, d_ff, dropout_prob=0.1):super().__init__()self.w_1 = nn.Linear(d_model, d_ff)self.w_2 = nn.Linear(d_ff, d_model)self.dropout = nn.Dropout(p=dropout_prob)def forward(self, x):return self.w_2(self.dropout(self.w_1(x).relu()))class EncoderLayer(nn.Module):def __init__(self, d_model, h, d_ff, dropout_prob):super().__init__()self.attn = MultiHeadedAttention(h, d_model, dropout_prob)self.ff = FeedForward(d_model, d_ff, dropout_prob)self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(2)])def forward(self, x, mask):x = self.sublayer[0](x, lambda x: self.attn(x, x, x, mask))return self.sublayer[1](x, self.ff)d_model = 512
d_ff = 4 * d_model
h = 8
dropout_prob = 0.1
encoder_layer = EncoderLayer(d_model, h, d_ff, dropout_prob)
print(cal_params(encoder_layer))
# (3152384, 0)

注意:在《Attention is All You Need》中 Add & Norm层的实现是:$${\rm LayerNorm}(x + {\rm Sublayer}(x))$$但可以看到,我们的代码实现方式为$$x+{\rm Sublayer(LayerNorm(x))}$$这在网上亦有讨论[Discussion],总的来说,后者的实现方式更加稳定,在许多实践中以后者为主。

Decoder layer

对比结构,decoder 每层的可训练参数相较于encoder layer多了一个Multi-head attention层和一个LayerNorm层,参数量为:

\[n_\text{decoder layer} = n_\text{encoder layer} + n_\text{atten} + n_\text{ln} \]

\[= (12d^2 + 13d) + (4d^2 + 4d) + 2d = 16d^2 + 19d \]

举例\(d_\text{model}=512\)时,Decoder layer的参数量为:

\[n_\text{decoder layer} = 16 \times 512^2 + 19 \times 512 = 4,204,032 \]

测试

class DecoderLayer(nn.Module):def __init__(self, d_model, h, d_ff, dropout_prob):super().__init__()self.self_attn = MultiHeadedAttention(h, d_model, dropout_prob)self.src_attn = MultiHeadedAttention(h, d_model, dropout_prob)self.ff = FeedForward(d_model, d_ff, dropout_prob)self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(3)])def forward(self, x, memory, src_mask, tgt_mask):x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))x = self.sublayer[1](x, lambda x: self.src_attn(x, memory, memory, src_mask))return self.sublayer[2](x, self.ff)d_model = 512
d_ff = 4 * d_model
h = 8
dropout_prob = 0.1
decoder_layer = DecoderLayer(d_model, h, d_ff, dropout_prob)
print(cal_params(decoder_layer))
# (4204032, 0)

Embedding层和生成层

Encoder和decoder分别使用一个embedding层,设encoder处理的词表大小为\(n_\text{src vocab}\),decoder处理的词表大小为\(n_\text{tgt vocab}\),特征维度为\(d_\text{model}=d\)。Encoder的embedding层参数量为:

\[n_\text{src emb}=n_\text{src vocab} \times d \]

Decoder的embedding层参数量为:

\[n_\text{tgt emb}=n_\text{tgt vocab} \times d \]

对于生成层,由于是一个线性层,参数量为:\(n_\text{gen}=(d + 1) \times n_\text{tgt vocab}\)。此外位置编码和残差连接没有可训练的参数。

举例\(d_\text{model}=512\)\(n_\text{src vocab}=n_\text{tgt vocab}=10000\)时,Embedding层和生成层的参数量为:$$n_\text{src emb}=n_\text{tgt emb}=512 \times 10000=5,120,000$$

测试

class Embeddings(nn.Module):def __init__(self, d_model, vocab):super().__init__()self.lut = nn.Embedding(num_embeddings=vocab, embedding_dim=d_model)self.d_model = d_modeldef forward(self, x):return self.lut(x) * math.sqrt(self.d_model)d_model = 512
d_ff = 4 * d_model
h = 8
dropout_prob = 0.1
vocal_size = 10000
embeddings = Embeddings(d_model, vocal_size)
print(cal_params(embeddings))
# (5120000, 0)

设encoder和decoder的层数为\(l\),则整个transformer的参数量为:

\[n = n_\text{src emb} + n_\text{tgt emb} + n_\text{gen} + l\times (n_\text{encoder layer} + n_\text{decoder layer}) \]

\[=n_\text{src vocab}\times d + n_\text{tgt vocab}\times d + (d + 1)\times n_\text{tgt vocab} + l\times (12d^2 + 13d + 16d^2 + 19d) \]

\[=d\times(n_\text{src vocab}+2n_\text{tgt vocab}) + n_\text{tgt vocab} + l\times(28d^2 + 32d) \]

举例\(d_\text{model}=512,\ n_\text{src vocab}=n_\text{tgt vocab}=10000,\ l=6\)时,整个transformer的参数量为:

\[512\times30000 + 10000 + 6\times(28\times512^2 + 32\times512) = 59,508,496 \]

测试

class Transformer(nn.Module):def __init__(self, src_vocab, tgt_vocab, d_model, d_ff, h, dropout_prob, layers):super().__init__()self.src_embed = nn.Sequential(Embeddings(d_model, src_vocab), PositionalEncoding(d_model, dropout_prob))self.tgt_embed = nn.Sequential(Embeddings(d_model, tgt_vocab), PositionalEncoding(d_model, dropout_prob))self.encoder = nn.ModuleList([EncoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])self.decoder = nn.ModuleList([DecoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])self.generator = nn.Linear(d_model, tgt_vocab)def forward(self, src, tgt, src_mask, tgt_mask):memory = self.encode(src, src_mask)return self.decode(memory, src_mask, tgt, tgt_mask)def encode(self, src, mask):x = self.src_embed(src)for layer in self.encoder:x = layer(x, mask)return xdef decode(self, memory, src_mask, tgt, tgt_mask):x = self.tgt_embed(tgt)for layer in self.decoder:x = layer(x, memory, src_mask, tgt_mask)return self.generator(x)d_model = 512
d_ff = 4 * d_model
h = 8
layer = 6
dropout_prob = 0.1
vocal_size = 10000
model = Transformer(vocal_size, vocal_size, d_model, d_ff, h, dropout_prob, layer)
print(cal_params(model))
# (59508496, 0)

运行环境

Package                   Version
------------------------- -----------
torch                     2.5.0

完整的代码:

import torch
import mathfrom torch import nndef cal_params(model: nn.Module):num_learnable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)num_non_learnable_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)return num_learnable_params, num_non_learnable_paramsclass AddAndNorm(nn.Module):def __init__(self, d_model, dropout_prob, eps=1e-6):super().__init__()self.w = nn.Parameter(torch.ones(d_model))self.b = nn.Parameter(torch.zeros(d_model))self.eps = epsself.dropout = nn.Dropout(p=dropout_prob)def norm(self, x):mean = x.mean(-1, keepdim=True)std = x.std(-1, keepdim=True)return self.w * (x - mean) / (std + self.eps) + self.bdef forward(self, x, sublayer):return x + self.dropout(sublayer(self.norm(x)))def attention(q, k, v, mask=None, dropout=None):d_k = q.size(-1)scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)if mask is not None:masking_value = -1e9 if scores.dtype == torch.float32 else -1e4scores = scores.masked_fill(mask == 0, masking_value)p_attn = scores.softmax(dim=-1)if dropout is not None:p_attn = dropout(p_attn)return torch.matmul(p_attn, v), p_attnclass MultiHeadedAttention(nn.Module):def __init__(self, h, d_model, dropout_prob=0.1):super().__init__()assert d_model % h == 0self.d_k = d_model // hself.h = hself.w_q = nn.Linear(d_model, d_model)self.w_k = nn.Linear(d_model, d_model)self.w_v = nn.Linear(d_model, d_model)self.w_o = nn.Linear(d_model, d_model)self.attn = Noneself.dropout = nn.Dropout(p=dropout_prob)def forward(self, q, k, v, mask=None):if mask is not None:mask = mask.unsqueeze(1)  # 相同的mask应用于所有的注意力头hbatch_size = q.size(0)# 1) 执行线性变换,将 d_model 维度的 x 分割成 h 个 d_k 维度q = self.w_q(q).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)k = self.w_k(k).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)v = self.w_v(v).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)# 2) 计算注意力x, self.attn = attention(q, k, v, mask=mask, dropout=self.dropout)# 3) 通过线性层连接多头注意力计算完的向量x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)return self.w_o(x)class FeedForward(nn.Module):def __init__(self, d_model, d_ff, dropout_prob=0.1):super().__init__()self.w_1 = nn.Linear(d_model, d_ff)self.w_2 = nn.Linear(d_ff, d_model)self.dropout = nn.Dropout(p=dropout_prob)def forward(self, x):return self.w_2(self.dropout(self.w_1(x).relu()))class EncoderLayer(nn.Module):def __init__(self, d_model, h, d_ff, dropout_prob):super().__init__()self.attn = MultiHeadedAttention(h, d_model, dropout_prob)self.ff = FeedForward(d_model, d_ff, dropout_prob)self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(2)])def forward(self, x, mask):x = self.sublayer[0](x, lambda x: self.attn(x, x, x, mask))return self.sublayer[1](x, self.ff)class DecoderLayer(nn.Module):def __init__(self, d_model, h, d_ff, dropout_prob):super().__init__()self.self_attn = MultiHeadedAttention(h, d_model, dropout_prob)self.src_attn = MultiHeadedAttention(h, d_model, dropout_prob)self.ff = FeedForward(d_model, d_ff, dropout_prob)self.sublayer = nn.ModuleList([AddAndNorm(d_model, dropout_prob) for _ in range(3)])def forward(self, x, memory, src_mask, tgt_mask):x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))x = self.sublayer[1](x, lambda x: self.src_attn(x, memory, memory, src_mask))return self.sublayer[2](x, self.ff)class Embeddings(nn.Module):def __init__(self, d_model, vocab):super().__init__()self.lut = nn.Embedding(num_embeddings=vocab, embedding_dim=d_model)self.d_model = d_modeldef forward(self, x):return self.lut(x) * math.sqrt(self.d_model)class PositionalEncoding(nn.Module):def __init__(self, d_model, dropout_prob, max_len=5000):super().__init__()self.dropout = nn.Dropout(p=dropout_prob)# 计算位置编码pe = torch.zeros(max_len, d_model)  # Shape: max_len x d_modelposition = torch.arange(0, max_len).unsqueeze(1)  # Shape: max_len x 1div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000) / d_model))res = position * div_term  # Shape: max_len x d_model/2pe[:, 0::2] = torch.sin(res)pe[:, 1::2] = torch.cos(res)pe = pe.unsqueeze(0)  # Shape: 1 x max_len x d_modelself.register_buffer('pe', pe)def forward(self, x):x = x + self.pe[:, :x.size(1)].requires_grad_(False)return self.dropout(x)class Transformer(nn.Module):def __init__(self, src_vocab, tgt_vocab, d_model, d_ff, h, dropout_prob, layers):super().__init__()self.src_embed = nn.Sequential(Embeddings(d_model, src_vocab), PositionalEncoding(d_model, dropout_prob))self.tgt_embed = nn.Sequential(Embeddings(d_model, tgt_vocab), PositionalEncoding(d_model, dropout_prob))self.encoder = nn.ModuleList([EncoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])self.decoder = nn.ModuleList([DecoderLayer(d_model, h, d_ff, dropout_prob) for _ in range(layers)])self.generator = nn.Linear(d_model, tgt_vocab)def forward(self, src, tgt, src_mask, tgt_mask):memory = self.encode(src, src_mask)return self.decode(memory, src_mask, tgt, tgt_mask)def encode(self, src, mask):x = self.src_embed(src)for layer in self.encoder:x = layer(x, mask)return xdef decode(self, memory, src_mask, tgt, tgt_mask):x = self.tgt_embed(tgt)for layer in self.decoder:x = layer(x, memory, src_mask, tgt_mask)return self.generator(x)if __name__ == '__main__':d_model = 512d_ff = 4 * d_modelh = 8layer = 6dropout_prob = 0.1vocal_size = 10000model = Transformer(vocal_size, vocal_size, d_model, d_ff, h, dropout_prob, layer)print(cal_params(model))

参考文献

  1. Pytorch 查看PyTorch模型中的总参数数量

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/854885.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

文献解读:采用波浪前缘的风电机组翼型后缘降噪实验研究

在半消声室内进行了仿生正弦波形前缘控制风力机翼型尾缘自噪声的实验研究,基于声波束形成结果,展示了波浪前缘的噪声控制效果,波浪形前缘振幅越大、波长越小,对翼型后缘降噪效果越好。最大声压级降低可达33.9dB。题目:采用波浪前缘的风电机组翼型后缘降噪实验研究 关键词:…

活动策划还能这样做,效率提升看得见!

文档协同如何助力电商团队降本增效? 电商行业以快著称,活动运营的每一秒都可能决定销量的成败。然而,在紧张的时间表下,如何确保团队成员高效协同,是很多电商运营者面临的难题。文档协同成为破解这一问题的关键手段。 打破信息差,构建透明化流程 在大型电商活动中,信息孤…

用WPF实现桌面锁屏壁纸的应用

用WPF实现桌面锁屏壁纸的应用 目录用WPF实现桌面锁屏壁纸的应用需求分析需求方案实现App.xamlApp.xaml.csMainWindow.xamlMainWindow.xaml.csImportImageHelper.csKeyboardHookLib.cs壁纸 需求分析 需求存取数据库二进制文件轮播图片显示系统时间滑动解锁禁用键盘添加托盘图标开…

GaussDB技术解读高性能——分布式优化器

GaussDB技术解读高性能——分布式优化器 分布式数据库场景下表分布在各个节点上,数据的本地性Data Locality是分布式优化器中生成执行计划时重点考虑的因素,基于Share Nothing的分布式数据库中有一个很关键概念就是“移动数据不如移动计算”,之所以有数据本地性就是因为数据…

15隐藏元素-文本溢出-盒子模型的四个部分

一、元素隐藏方法 在HTML开发过程当中存在一些元素我们想要将一些元素隐藏起来,元素如果想要隐藏有哪些方式: (1)将display设置为none页面上不显示,但是HTML仍然存在 并且也不占据位置和空间了,后面的元素就会跑上来。 (2)visibility设置为hidden visibility这个单词是…

manim边学边做--突出显示

本篇介绍Manim中用于突出显示某些内容的动画类,主要包括:ApplyWave:让图形或文字产生连续波浪式变形的动画类,用于展示波动效果,参数可调节 Circumscribe:用于在几何场景中展示图形与其外接图形的关系,动画围绕对象生成外接图形 Flash:通过快速改变对象视觉属性产生闪烁…

今天是周三?

符合题墓的标题,朴实无华[WUSTCTF2020]朴实无华 首先打开页面,发现无信息含泪扫墓路,发现robots.txt访问其中的链接,此时发现http头中藏有一个路径直接访问,得到以下代码,有一点点乱码,用抓包软件打开即可发现代码分为三关,我们一关一关看 第一关:intval绕过 //level …

《刚刚问世》系列初窥篇-Java+Playwright自动化测试-7-元素基础定位方式-下篇 (详细教程)

1.简介 上一篇主要是讲解我们日常工作中在使用Playwright进行元素定位的一些比较常用的基础定位方式的理论基础知识以及在什么情况下推荐使用。今天这一篇讲解和分享一下剩下部分的基础定位方式。 2.过滤器定位 例如以下 DOM 结构,我们要在其中单击第二个产品卡的购买按钮。我…

实景三维赋能智慧城市时空基础设施建设

随着信息技术的飞速发展,智慧城市建设已成为全球城市发展的新趋势。实景三维技术作为智慧城市建设的重要支撑,对于构建时空基础设施具有不可替代的作用。本文将探讨实景三维技术如何为智慧城市的时空基础设施建设提供强大动力。一、智慧城市时空基础设施的挑战智慧城市的时空…

没有域名如何申请SSL证书

SSL证书一般多应用于域名上,可以保证网站里面的数据不会被泄露,加强网站安全,也加强浏览者的信任度。但是有一种特殊的情况,在网站没有域名或者域名还没有准备好的时候,只有IP地址,能否安装SSL证书呢,答案是可以的,本文将介绍IP SSL证书的应用场景和申请方式。 IP SSL证…

《DNK210使用指南 -CanMV版 V1.0》第四十四章 人脸68关键点检测实验

第四十四章 人脸68关键点检测实验 1)实验平台:正点原子DNK210开发板 2)章节摘自【正点原子】DNK210使用指南 - CanMV版 V1.0 3)购买链接:https://detail.tmall.com/item.htm?&id=782801398750 4)全套实验源码+手册+视频下载地址:http://www.openedv.com/docs/board…