一、问题来源：

from transformers import AutoTokenizer, AutoModel
import torch
# Load model from HuggingFace Hub
MODEL_NAME_PATH = 'xxxx/model/bge-large-zh'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME_PATH)
model = AutoModel.from_pretrained(MODEL_NAME_PATH)

模型结构如下：

BertModel((embeddings): BertEmbeddings((word_embeddings): Embedding(21128, 1024, padding_idx=0)(position_embeddings): Embedding(512, 1024)(token_type_embeddings): Embedding(2, 1024)(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): BertEncoder((layer): ModuleList((0-23): 24 x BertLayer((attention): BertAttention((self): BertSelfAttention((query): Linear(in_features=1024, out_features=1024, bias=True)(key): Linear(in_features=1024, out_features=1024, bias=True)(value): Linear(in_features=1024, out_features=1024, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): BertSelfOutput((dense): Linear(in_features=1024, out_features=1024, bias=True)(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): BertIntermediate((dense): Linear(in_features=1024, out_features=4096, bias=True)(intermediate_act_fn): GELUActivation())(output): BertOutput((dense): Linear(in_features=4096, out_features=1024, bias=True)(LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))))(pooler): BertPooler((dense): Linear(in_features=1024, out_features=1024, bias=True)(activation): Tanh())
)

Q1、cls的值和pooler的值是一样的吗？
Q2、最后的pooler层和hidden层是什么关系?

二、实验证明：

Q1、cls的值和pooler的值是一样的吗？

# Sentences we want sentence embeddings for
sentences = ["开心", "快乐", "难过", "天气", "今天会有大大的台风吗？"]
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=200)
# for retrieval task, add an instruction to query
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():model_output = model(**encoded_input)# Perform pooling. In this case, cls pooling.sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)

print(‘cls:’, model_output[0][:, 0, :])

cls: tensor([[ 0.3269, -0.6412, -0.2382,  ...,  0.0255, -0.1801, -0.3025],[ 0.1351, -0.5155, -0.1700,  ...,  0.1093, -0.3750, -0.1323],[ 0.2752, -0.1703, -0.2730,  ...,  0.0376, -0.0339, -0.3541],[ 0.1346, -0.0378, -0.5070,  ...,  0.0078,  0.0472, -0.1815],[-0.4051,  0.1123, -0.3873,  ...,  0.3585,  0.4913,  0.3192]])

print(‘pooler:’, model_output[1])

pooler: tensor([[ 0.3888, -0.2329, -0.1749,  ...,  0.1678,  0.3938, -0.3191],[ 0.3949, -0.2882, -0.0945,  ...,  0.1802,  0.2705, -0.1891],[ 0.4765, -0.1235, -0.2330,  ...,  0.3005,  0.3487, -0.1290],[ 0.3851, -0.1853, -0.3189,  ...,  0.2757,  0.3601, -0.3220],[ 0.3008, -0.3742, -0.4550,  ...,  0.4318,  0.2130, -0.1575]])

cls的值和pooler的值不一样

Q2、最后的pooler层和hidden层是什么关系?

理论层面：

transformers.models.bert.modeling_bert.BertModel.forward方法中这么一行代码：

sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

pooler的定义：

self.pooler = BertPooler(config) if add_pooling_layer else None

BertPooler的定义：

class BertPooler(nn.Module):def __init__(self, config):super().__init__()self.dense = nn.Linear(config.hidden_size, config.hidden_size)self.activation = nn.Tanh()def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:# We "pool" the model by simply taking the hidden state corresponding# to the first token.first_token_tensor = hidden_states[:, 0]pooled_output = self.dense(first_token_tensor)pooled_output = self.activation(pooled_output)return pooled_output

从上面的源码可以看出，pooler_output 就是[CLS]embedding又经历了一次全连接层的输出

数据层面：

model.pooler(model_output[0])

tensor([[ 0.3888, -0.2329, -0.1749,  ...,  0.1678,  0.3938, -0.3191],[ 0.3949, -0.2882, -0.0945,  ...,  0.1802,  0.2705, -0.1891],[ 0.4765, -0.1235, -0.2330,  ...,  0.3005,  0.3487, -0.1290],[ 0.3851, -0.1853, -0.3189,  ...,  0.2757,  0.3601, -0.3220],[ 0.3008, -0.3742, -0.4550,  ...,  0.4318,  0.2130, -0.1575]],grad_fn=<TanhBackward0>)

在这里插入图片描述
pooler_output 就是[CLS]embedding又经历了一次全连接层的输出

三、结论：

pooler就是将[CLS]这个token再过一下全连接层+Tanh激活函数，作为该句子的特征向量

四、Bert的Pooler_output的由来

我们知道，BERT的训练包含两个任务：MLM和NSP任务（Next Sentence Prediction）。对这两个任务不熟悉的朋友可以参考：BERT源码实现与解读(Pytorch) 和【论文阅读】BERT 两篇文章。

其中MLM就是挖空，然后让bert预测这个空是什么。做该任务是使用token embedding进行预测。

而Next Sentence Prediction就是预测bert接受的两句话是否为一对。例如：窗前明月光，疑是地上霜为 True，窗前明月光，李白打开窗为False。

所以，NSP任务需要句子的语义信息来预测，但是我们看下源码是怎么做的。

class BertForNextSentencePrediction(BertPreTrainedModel):def __init__(self, config):super().__init__(config)self.bert = BertModel(config)self.cls = BertOnlyNSPHead(config)	# 这个就是一个 nn.Linear(config.hidden_size, 2)...def forward(...):...outputs = self.bert(...)pooled_output = outputs[1] # 取pooler_outputseq_relationship_scores = self.cls(pooled_output)	# 使用pooler_ouput送给后续的全连接层进行预测...从上面的源码可以看出，在NSP任务训练时，并不是直接使用[CLS]token的embedding作为句子特征传给后续分类头的，而是使用的是pooler_output。个人原因可能是因为直接使用[CLS]的embedding效果不够好。
但在MLM任务时，是直接使用的是last_hidden_state，有兴趣可以看一下