让chatGPT使用Tensor flow Keras组装Bert,GPT,Transformer

让chatGPT使用Tensor flow Keras组装Bert,GPT,Transformer

    • implement Transformer Model by Tensor flow Keras
    • implement Bert model by Tensor flow Keras
    • implement GPT model by Tensor flow Keras

本文主要展示Transfomer, Bert, GPT的神经网络结构之间的关系和差异。网络上有很多资料,但是把这个关系清晰展示清楚的不多。本文作为一个补充资料组织,同时利用chatGPT,让它使用Tensor flow Keras 来组装对应的迷你代码辅助理解。

从这个组装,可以直观的看到:

  • Transformer: Encoder-Decoder 模块都用到了
  • Bert: 只用到了Transformer的Encoder来作模块组装
  • GPT: 只用到了Transformer的Decoder来做模块的组装

implement Transformer Model by Tensor flow Keras

在这里插入图片描述

网上有大量讲解Transformer每层做什么的事情,这个可以单独一篇文章拆解我的理解。本文档假设在这点上读者已经理解了。

import tensorflow as tfclass Transformer(tf.keras.Model):def __init__(self, num_layers, d_model, num_heads, d_ff, input_vocab_size, target_vocab_size, dropout_rate=0.1):super(Transformer, self).__init__()self.encoder = Encoder(num_layers, d_model, num_heads, d_ff, input_vocab_size, dropout_rate)self.decoder = Decoder(num_layers, d_model, num_heads, d_ff, target_vocab_size, dropout_rate)self.final_layer = tf.keras.layers.Dense(target_vocab_size)def call(self, inputs, targets, enc_padding_mask, look_ahead_mask, dec_padding_mask):enc_output = self.encoder(inputs, enc_padding_mask)dec_output = self.decoder(targets, enc_output, look_ahead_mask, dec_padding_mask)final_output = self.final_layer(dec_output)return final_outputclass Encoder(tf.keras.layers.Layer):def __init__(self, num_layers, d_model, num_heads, d_ff, vocab_size, dropout_rate=0.1):super(Encoder, self).__init__()self.num_layers = num_layersself.embedding = tf.keras.layers.Embedding(vocab_size, d_model)self.positional_encoding = PositionalEncoding(vocab_size, d_model)self.encoder_layers = [EncoderLayer(d_model, num_heads, d_ff, dropout_rate) for _ in range(num_layers)]self.dropout = tf.keras.layers.Dropout(dropout_rate)def call(self, inputs, padding_mask):embedded_input = self.embedding(inputs)positional_encoded_input = self.positional_encoding(embedded_input)encoder_output = self.dropout(positional_encoded_input)for i in range(self.num_layers):encoder_output = self.encoder_layers[i](encoder_output, padding_mask)return encoder_outputclass Decoder(tf.keras.layers.Layer):def __init__(self, num_layers, d_model, num_heads, d_ff, vocab_size, dropout_rate=0.1):super(Decoder, self).__init__()self.num_layers = num_layersself.embedding = tf.keras.layers.Embedding(vocab_size, d_model)self.positional_encoding = PositionalEncoding(vocab_size, d_model)self.decoder_layers = [DecoderLayer(d_model, num_heads, d_ff, dropout_rate) for _ in range(num_layers)]self.dropout = tf.keras.layers.Dropout(dropout_rate)def call(self, inputs, encoder_output, look_ahead_mask, padding_mask):embedded_input = self.embedding(inputs)positional_encoded_input = self.positional_encoding(embedded_input)decoder_output = self.dropout(positional_encoded_input)for i in range(self.num_layers):decoder_output = self.decoder_layers[i](decoder_output, encoder_output, look_ahead_mask, padding_mask)return decoder_outputclass EncoderLayer(tf.keras.layers.Layer):def __init__(self, d_model, num_heads, d_ff, dropout_rate=0.1):super(EncoderLayer, self).__init__()self.multi_head_attention = MultiHeadAttention(d_model, num_heads)self.ffn = FeedForwardNetwork(d_model, d_ff)self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.dropout1 = tf.keras.layers.Dropout(dropout_rate)self.dropout2 = tf.keras.layers.Dropout(dropout_rate)def call(self, inputs, padding_mask):attention_output = self.multi_head_attention(inputs, inputs, inputs, padding_mask)attention_output = self.dropout1(attention_output)attention_output = self.layer_norm1(inputs + attention_output)ffn_output = self.ffn(attention_output)ffn_output = self.dropout2(ffn_output)encoder_output = self.layer_norm2(attention_output + ffn_output)return encoder_outputclass DecoderLayer(tf.keras.layers.Layer):def __init__(self, d_model, num_heads, d_ff, dropout_rate=0.1):super(DecoderLayer, self).__init__()self.multi_head_attention1 = MultiHeadAttention(d_model, num_heads)self.multi_head_attention2 = MultiHeadAttention(d_model, num_heads)self.ffn = FeedForwardNetwork(d_model, d_ff)self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.layer_norm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.dropout1 = tf.keras.layers.Dropout(dropout_rate)self.dropout2 = tf.keras.layers.Dropout(dropout_rate)self.dropout3 = tf.keras.layers.Dropout(dropout_rate)def call(self, inputs, encoder_output, look_ahead_mask, padding_mask):attention1_output = self.multi_head_attention1(inputs, inputs, inputs, look_ahead_mask)attention1_output = self.dropout1(attention1_output)attention1_output = self.layer_norm1(inputs + attention1_output)attention2_output = self.multi_head_attention2(attention1_output, encoder_output, encoder_output, padding_mask)attention2_output = self.dropout2(attention2_output)attention2_output = self.layer_norm2(attention1_output + attention2_output)ffn_output = self.ffn(attention2_output)ffn_output = self.dropout3(ffn_output)decoder_output = self.layer_norm3(attention2_output + ffn_output)return decoder_outputclass MultiHeadAttention(tf.keras.layers.Layer):def __init__(self, d_model, num_heads):super(MultiHeadAttention, self).__init__()self.num_heads = num_headsself.d_model = d_modelself.depth = d_model // num_headsself.wq = tf.keras.layers.Dense(d_model)self.wk = tf.keras.layers.Dense(d_model)self.wv = tf.keras.layers.Dense(d_model)self.dense = tf.keras.layers.Dense(d_model)def split_heads(self, x, batch_size):x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))return tf.transpose(x, perm=[0, 2, 1, 3])def call(self, query, key, value, mask):batch_size = tf.shape(query)[0]q = self.wq(query)k = self.wk(key)v = self.wv(value)q = self.split_heads(q, batch_size)k = self.split_heads(k, batch_size)v = self.split_heads(v, batch_size)scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))output = self.dense(concat_attention)return output, attention

implement Bert model by Tensor flow Keras

在这里插入图片描述

其中,左侧的每个Trm代表,右侧的放大图,也就是原始Transformer的Encoder部分结构。同时可以看到,Bert在左侧中,是双向组装Transformer的。Bert的训练任务包括MLM(Masked Language Model)和NSP(Next Sentence Prediction). Bert的训练是无监督的,因为MLM实际上就是将语料的某些Token遮挡起来,那么输出结果需要知道答案是什么(标注信息)实际上就包含在语料里。从这个角度来说,实际上也是监督的。

import tensorflow as tfclass BERT(tf.keras.Model):def __init__(self, vocab_size, hidden_size, num_attention_heads, num_transformer_layers, intermediate_size):super(BERT, self).__init__()self.embedding = tf.keras.layers.Embedding(vocab_size, hidden_size)self.transformer_layers = [TransformerLayer(hidden_size, num_attention_heads, intermediate_size) for _ in range(num_transformer_layers)]self.dropout = tf.keras.layers.Dropout(0.1)def call(self, inputs, attention_mask):embedded_input = self.embedding(inputs)hidden_states = embedded_inputfor transformer_layer in self.transformer_layers:hidden_states = transformer_layer(hidden_states, attention_mask)hidden_states = self.dropout(hidden_states)return hidden_statesclass TransformerLayer(tf.keras.layers.Layer):def __init__(self, hidden_size, num_attention_heads, intermediate_size):super(TransformerLayer, self).__init__()self.attention = MultiHeadAttention(hidden_size, num_attention_heads)self.feed_forward = FeedForward(hidden_size, intermediate_size)self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.dropout1 = tf.keras.layers.Dropout(0.1)self.dropout2 = tf.keras.layers.Dropout(0.1)def call(self, inputs, attention_mask):attention_output = self.attention(inputs, inputs, inputs, attention_mask)attention_output = self.dropout1(attention_output)attention_output = self.layer_norm1(inputs + attention_output)feed_forward_output = self.feed_forward(attention_output)feed_forward_output = self.dropout2(feed_forward_output)layer_output = self.layer_norm2(attention_output + feed_forward_output)return layer_outputclass MultiHeadAttention(tf.keras.layers.Layer):def __init__(self, hidden_size, num_attention_heads):super(MultiHeadAttention, self).__init__()self.num_attention_heads = num_attention_headsself.attention_head_size = hidden_size // num_attention_headsself.query = tf.keras.layers.Dense(hidden_size)self.key = tf.keras.layers.Dense(hidden_size)self.value = tf.keras.layers.Dense(hidden_size)self.dense = tf.keras.layers.Dense(hidden_size)def call(self, query, key, value, attention_mask):query = self.query(query)key = self.key(key)value = self.value(value)query = self._split_heads(query)key = self._split_heads(key)value = self._split_heads(value)attention_scores = tf.matmul(query, key, transpose_b=True)attention_scores /= tf.math.sqrt(tf.cast(self.attention_head_size, attention_scores.dtype))if attention_mask is not None:attention_scores += attention_maskattention_probs = tf.nn.softmax(attention_scores, axis=-1)context_layer = tf.matmul(attention_probs, value)context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])context_layer = tf.reshape(context_layer, (tf.shape(context_layer)[0], -1, hidden_size))attention_output = self.dense(context_layer)return attention_outputdef _split_heads(self, x):batch_size = tf.shape(x)[0]length = tf.shape(x)[1]x = tf.reshape(x, (batch_size, length, self.num_attention_heads,

implement GPT model by Tensor flow Keras

在这里插入图片描述

其中,左侧的每个Trm放大都是右侧的部分,也就是Transfomer原始结构里的Decoder部分。同时可以看到,GPT在左侧中,是单向组装Transformer的。GPT的训练任务就是生成下一个Token。GPT是无监督的,因为从机器学习的角度,输出数据需要的「标注信息」(下一个Token)就是语料已经提供的。从这个角度来说,其实也是监督的。

import tensorflow as tfclass GPT(tf.keras.Model):def __init__(self, vocab_size, hidden_size, num_layers, num_heads, intermediate_size, max_seq_length):super(GPT, self).__init__()self.embedding = tf.keras.layers.Embedding(vocab_size, hidden_size)self.positional_encoding = PositionalEncoding(max_seq_length, hidden_size)self.transformer_blocks = [TransformerBlock(hidden_size, num_heads, intermediate_size) for _ in range(num_layers)]self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.final_dense = tf.keras.layers.Dense(vocab_size, activation='softmax')def call(self, inputs):embedded_input = self.embedding(inputs)positional_encoded_input = self.positional_encoding(embedded_input)hidden_states = positional_encoded_inputfor transformer_block in self.transformer_blocks:hidden_states = transformer_block(hidden_states)final_output = self.final_layer_norm(hidden_states)final_output = self.final_dense(final_output)return final_outputclass TransformerBlock(tf.keras.layers.Layer):def __init__(self, hidden_size, num_heads, intermediate_size):super(TransformerBlock, self).__init__()self.attention = MultiHeadAttention(hidden_size, num_heads)self.feed_forward = FeedForward(hidden_size, intermediate_size)self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)def call(self, inputs):attention_output = self.attention(inputs, inputs, inputs)attention_output = self.layer_norm1(inputs + attention_output)feed_forward_output = self.feed_forward(attention_output)layer_output = self.layer_norm2(attention_output + feed_forward_output)return layer_outputclass MultiHeadAttention(tf.keras.layers.Layer):def __init__(self, hidden_size, num_heads):super(MultiHeadAttention, self).__init__()self.num_heads = num_headsself.attention_head_size = hidden_size // num_headsself.query = tf.keras.layers.Dense(hidden_size)self.key = tf.keras.layers.Dense(hidden_size)self.value = tf.keras.layers.Dense(hidden_size)self.dense = tf.keras.layers.Dense(hidden_size)def call(self, query, key, value):query = self.query(query)key = self.key(key)value = self.value(value)query = self._split_heads(query)key = self._split_heads(key)value = self._split_heads(value)attention_scores = tf.matmul(query, key, transpose_b=True)attention_scores /= tf.math.sqrt(tf.cast(self.attention_head_size, attention_scores.dtype))attention_probs = tf.nn.softmax(attention_scores, axis=-1)context_layer = tf.matmul(attention_probs, value)context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])context_layer = tf.reshape(context_layer, (tf.shape(context_layer)[0], -1, hidden_size))attention_output = self.dense(context_layer)return attention_outputdef _split_heads(self, x):batch_size = tf.shape(x)[0]length = tf.shape(x)[1]x = tf.reshape(x, (batch_size, length, self.num_heads, self.attention_head_size))return tf.transpose(x, perm=[0, 2, 1, 3])

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/8826.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【软件测试】盘一盘工作中遇到的 Redis 异常测试

目录 前言: 一、更新 Key 异常 二、Key的删除和丢失 三、KEY 过期策略不当造成内存泄漏 四、查询Redis异常时处理 五、redis 穿透、击穿、雪崩 六、Redis死锁 七、Redis持久化 八、缓存与数据库双写时的数据一致性 前言: 在软件测试过程中&…

国产操作系统介绍和安装

国产操作系统 分类 操作系统分类国产操作系统银河麒麟中科方德统信UOS红旗Linux深度系统优麒麟系统 具体介绍 麒麟操作系统 麒麟操作系统(Kylin操作系统,简称麒麟OS),是一种国产操作系统,由国防科技大学研发&#x…

用pytorch进行BERT文本分类

BERT 是一个强大的语言模型,至少有两个原因: 它使用从 BooksCorpus (有 8 亿字)和 Wikipedia(有 25 亿字)中提取的未标记数据进行预训练。顾名思义,它是通过利用编码器堆栈的双向特性进行预训练…

Vulkan Tutorial 10 重采样

目录 30 多重采样 获得可用的样本数 设置一个渲染目标 添加新的附件 30 多重采样 我们的程序现在可以为纹理加载多层次的细节,这修复了在渲染离观众较远的物体时出现的假象。现在的图像平滑了许多,然而仔细观察,你会发现在绘制的几何图形…

【C++】复杂的菱形继承 及 菱形虚拟继承的底层原理

文章目录 1. 单继承2. 多继承3. 菱形继承3.1 菱形继承的问题——数据冗余和二义性3.2 解决方法——虚拟继承3.3 虚拟继承的原理 4. 继承和组合5. 继承的反思和总结 1. 单继承 在上一篇文章中,我们给大家演示的其实都是单继承。 单继承的概念: 单继承&a…

Flutter如何获取屏幕的分辨率和实际画布的分辨率

Flutter如何获取分辨率 在Flutter中,你可以使用MediaQuery来获取屏幕的分辨率和实际画布的分辨率。 要获取屏幕的分辨率,你可以使用MediaQuery.of(context).size属性,它返回一个Size对象,其中包含屏幕的宽度和高度。下面是一个获…

POSTGRESQL SQL 执行用 IN 还是 EXISTS 还是 ANY

开头还是介绍一下群,如果感兴趣polardb ,mongodb ,mysql ,postgresql ,redis 等有问题,有需求都可以加群群内有各大数据库行业大咖,CTO,可以解决你的问题。加群请联系 liuaustin3 ,在新加的朋友会分到3群(共…

SQL频率低但笔试会遇到: 触发器、索引、外键约束

一. 前言 在SQL面笔试中,对于表的连接方式,过滤条件,窗口函数等肯定是考察的重中之重,但是有一些偶尔会出现,频率比较低但是至少几乎会遇见一两次的题目,就比如触发器,索引和外键约束&#xff0…

Spring源码解析(二):bean容器的创建、默认后置处理器、扫描包路径bean

Spring源码系列文章 Spring源码解析(一):环境搭建 Spring源码解析(二): 目录 一、Spring源码基础组件1、bean定义接口体系2、bean工厂接口体系3、ApplicationContext上下文体系 二、AnnotationConfigApplicationContext注解容器1、创建bean工厂-beanFa…

【Nginx05】Nginx学习:HTTP核心模块(二)Server

Nginx学习:HTTP核心模块(二)Server 第一个重要的子模块就是这个 Server 相关的模块。Server 代表服务的意思,其实就是这个 Nginx 的 HTTP 服务端所能提供的服务。或者更直白点说,就是虚拟主机的配置。通过 Server &…

iview切换Select时选项丢失,重置Seletc时选项丢失

分析原因 在旧版本的iview中如果和filterable一起使用时,当值清空选项或者使用重置按钮清空时选项会丢失。 解决方式一 把去掉filterable 解决方式二 使用ref,调用clearSingleSelect()方法清空 ref"perfSelect" this.$refs.perfSelect.c…

Java链式编程

一、链式编程 1.1.释义 链式编程,也叫级联式编程,调用对象的函数时返回一个this对象指向对象本身,达到链式效果,可以级联调用。 1.2.特点 可以通过一个方法调用多个方法,将多个方法调用链接起来,形成一…