大模型 Decoder 的生成策略-编程知识

本文将介绍以下内容：

Introduction
Greedy Search
beam search
Sampling
Top-K Sampling
Top-p (nucleus) sampling
总结

一、Introduction

1、简介

近年来，由于在数百万个网页数据上训练的大型基于 Transformer 的语言模型的兴起，开放式语言生成引起了越来越多的关注，其中包括OpenAI著名的GPT2模型。在条件开放式语言生成方面，取得了令人瞩目的成果。除了改进的 Transformer 架构和大规模无监督训练数据外，更好的 decoder 方法也起到了重要作用。

本文简要介绍了不同的解码策略，并展示了如何使用 huggingface 的 transformers 库来轻松实现它们！

本文介绍的所有功能都能用于自回归语言生成。简而言之，自回归语言生成是基于这样的假设，即一个词序列的概率分布可以分解为下一个词条件概率分布的乘积：
在这里插入图片描述
其中 W0 是初始上下文词序列；单词序列的长度 T 通常是生成时即时 (on-the-fly) 确定的，即当某个时刻 t 出现了 EOS token 则可以停止单词序列生成。

2、环境配置

a)、transformers 库安装

pip install -q git+https://github.com/huggingface/transformers.git
pip install -q tensorflow==2.1

b)、导入 gpt2 模型

import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

二、Greedy Search（贪婪搜索）

Greedy search 是最简单的解码方法。它选择概率最高的单词作为下一个单词： $w_t$ = $argmax_x$ $P(W|W_{1:t-1})$ 在每个时间步 t。下图显示了 Greedy search。
在这里插入图片描述
从 “The” 这个词开始，每一步都选择概率最大的单词，分别选择了 “nice” 和 “woman”，最后这样选择的整体概率为 0.5 * 0.4 = 0.2。

接下来，我们将使用GPT2在上下文（“I”, “enjoy”, “walking”, “with”, “my”, “cute”, “dog”）。让我们看看在transformers中如何使用贪婪搜索：

# encode context the generation is conditioned on
model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)# generate 40 new tokens
greedy_output = model.generate(**model_inputs, max_new_tokens=40)print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))# Output:
# ----------------------------------------------------------------------------------------------------
# I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.
# I'm not sure if I'll

太好了！我们使用GPT2生成了我们的第一个短文本。在上下文后生成的单词是合理的，但模型很快开始重复自己！这在语言生成中是一个非常常见的问题，而在Greedy Search（贪婪搜索）和 beam search （束搜索）中似乎更为明显。

然而，Greedy Search（贪婪搜索）的主要缺点是它会错过隐藏在低概率单词后面的高概率单词，就像我们上面的示例中所看到的那样：

高条件概率为0.9的单词"has"被隐藏在只有第二高条件概率的单词"dog"后面，所以贪婪搜索错过了单词序列 “The”,“dog”,“has”。

幸运的是，我们有 beam search（束搜索）来缓解这个问题！

三、beam search

beam search（束搜索）通过在每个时间步骤保留最有可能的 num_beams 个假设，最终选择具有最高概率的假设，从而降低错过隐藏的高概率词序列的风险。 让我们以 num_beams=2 的情况来说明：
在这里插入图片描述
在时间步骤1，除了最有可能的假设（“The”,“nice”）之外，束搜索还跟踪第二个最有可能的假设（“The”,“dog”）。在时间步骤2，束搜索发现词序列（“The”,“dog”,“has”）的概率为0.36，高于词序列（“The”,“nice”,“woman”）的概率（0.2）。太棒了，在我们的示例中，它找到了最有可能的词序列！

beam search (束搜索)总是能找到比 gready search (贪婪搜索) 更高概率的输出序列，但并不能保证找到最有可能的输出。

让我们看看在 transformers 库中如何使用束搜索。我们设置 num_beams > 1 和 early_stopping=True，这样当所有束假设都达到 EOS 符号时，生成就会结束。

# activate beam search and early_stopping
beam_output = model.generate(input_ids,max_length=50,num_beams=5,early_stopping=True
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
# Output:
# ----------------------------------------------------------------------------------------------------
# I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.# I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll

虽然结果可能更加流畅，但输出仍然包含相同的词序列的重复。
一个简单的解决办法是引入 n-gram（即由 n 个词组成的词序列）的惩罚，这是由 Paulus 等人（2017年）和 Klein 等人（2017年）引入的。最常见的n 元语法惩罚通过手动将可能创建已见过的n元语法的下一个单词的概率设置为 0 来确保 n 元语法不会出现两次。

让我们尝试一下，将 no_repeat_ngram_size 设置为 2，以确保没有重复出现的 2-gram：
让我们通过设置来尝试一下no_repeat_ngram_size=2，这样2-gram不会出现两次：

# set no_repeat_ngram_size to 2
beam_output = model.generate(input_ids,max_length=50,num_beams=5,no_repeat_ngram_size=2,early_stopping=True
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
# Output:
# ----------------------------------------------------------------------------------------------------
# I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.# I've been thinking about this for a while now, and I think it's time for me to take a break

太好了，看起来好多了！我们可以看到不再出现重复的词序列。然而，使用 n-gram 惩罚时需要谨慎。生成有关纽约市的文章不应该使用 2-gram 惩罚，否则整个文本中只会出现一次城市的名称！

束搜索的另一个重要特性是，在生成后我们可以比较生成概率较大的束并选择最适合我们目的的生成束。

在 transformers 库中，我们只需将参数 num_return_sequences 设置为应返回的最高分束的数量。但请确保 num_return_sequences <= num_beams！

# set return_num_sequences > 1
beam_outputs = model.generate(input_ids,max_length=50,num_beams=5,no_repeat_ngram_size=2,num_return_sequences=5,early_stopping=True
)# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
# Output:
# ----------------------------------------------------------------------------------------------------
# 0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.# I've been thinking about this for a while now, and I think it's time for me to take a break
# 1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.# I've been thinking about this for a while now, and I think it's time for me to get back to
# 2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.# I've been thinking about this for a while now, and I think it's time for me to take a break
# 3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.# I've been thinking about this for a while now, and I think it's time for me to get back to
# 4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.# I've been thinking about this for a while now, and I think it's time for me to take a step

正如我们所看到的，这五个束的假设之间几乎没有太大的区别——当只使用 5 个束时，这应该并不令人太惊讶。

在开放式生成中，最近提出了几个原因，解释了为什么 beam search (束搜索) 可能不是最佳选择：

beam search (束搜索) 在诸如机器翻译或摘要生成等预期生成长度相对可预测的任务中效果非常好——参见 Murray 等人（2018年）和 Yang 等人（2018年）。但是，在开放式生成中情况并非如此，期望的输出长度可能会有很大的变化，例如对话和故事生成。
我们已经看到束搜索在重复生成方面存在严重问题。在故事生成中，使用 n-gram 或其他惩罚特别难以控制，因为在强制执行“不重复”和重复相同 n-gram 的循环之间找到良好的平衡需要进行很多微调。
正如 Ari Holtzman 等人（2019年）所论证的，高质量的人类语言并不遵循高概率的下一个词分布。换句话说，作为人类，我们希望生成的文本给我们带来惊喜，而不是无聊/可预测的。作者通过绘制模型对人类文本的概率与 beamsearch (束搜索) 的概率之间的对比，很好地展示了这一点。

所以让我们停止无聊并引入一些随机性。

四、Sampling

在其最基本的形式中，采样意味着根据条件概率分布随机选择下一个词
在这里插入图片描述
以前面的例子为例，下面的图形展示了使用采样进行语言生成的情况

很明显，使用采样进行语言生成不再是确定性的。 单词 (“car”) 从条件概率分布 P(w∣"The") 中进行采样，接着从条件概率分布 P(w∣"The",“car”) 中进行采样，选取了 (“drives”)。

在 transformers 库中，我们设置 do_sample=True，并通过 top_k=0 关闭 Top-K 采样（稍后详细介绍）。为了说明目的，我们将随机种子 random_seed 设置为 0。随意更改 random_seed 来尝试模型。

# set seed to reproduce results. Feel free to change the seed though to get different results
from transformers import set_seed
set_seed(42)# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(**model_inputs,max_new_tokens=40,do_sample=True,top_k=0
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))# Output:
# ----------------------------------------------------------------------------------------------------
# I enjoy walking with my cute dog for the rest of the day, but this had me staying in an unusual room and not going on nights out with friends (which will always be wondered for a mere minute or so at this point).

有趣！文本看起来还不错，但仔细观察发现，它并不是非常连贯。听起来不像是人类写的。这就是在 Sampling 采样词序列时的一个大问题：模型经常生成不连贯的胡言乱语， 参考 Ari Holtzman 等人（2019年）。

一个技巧是进行分布 $P(w|w_{1:t-1})$ temperature 通过降低 softmax 的所谓值来变得更清晰（增加高概率单词的可能性并降低低概率单词的可能性）。

将温度应用于上面示例的示例如下所示。
在这里插入图片描述
第 t=1 步的条件下一个词分布变得更加尖锐，几乎没有机会选择单词 (“car”)。

让我们看看如何通过设置 temperature=0.6 来使分布变得更尖锐：

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(**model_inputs,max_new_tokens=40,do_sample=True,top_k=0,temperature=0.6,
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))# Output:
# ----------------------------------------------------------------------------------------------------
# I enjoy walking with my cute dog, but I don't like to chew on it. I like to eat it and not chew on it. I like to be able to walk with my dog."# So how did you decide

好的。现在奇怪的 n-gram 较少了，输出的连贯性稍微提高了一点！尽管应用温度可以使分布变得不那么随机，在温度设置为接近 0 时，温度调节的采样将等同于贪婪解码，并将面临与之前相同的问题。

五、Top-K Sampling

Fan et al.（2018年）提出了一种简单但非常强大的采样方案，称为Top-K采样。在Top-K采样中，选择最有可能的K个下一个词，并将概率质量重新分配给这 K 个下一个词。 GPT2采用了这种采样方案，这是其在故事生成中取得成功的原因之一。

为了更好地说明Top-K采样，我们将上述例子中两个采样步骤使用的词池范围从3个词扩展到10个词。
在这里插入图片描述
设定 K=6，在两个采样步骤中，我们将采样池限制为6个词。尽管在第一步中，定义为
的 6 个最有可能的词仅占据了大约三分之二的概率质量，但在第二步中，几乎包含了所有的概率质量。尽管如此，我们可以看到它成功地消除了第二个采样步骤中相当奇怪的候选词 (“not”, “the”, “small”, “told”)。

让我们看看如何在库中使用Top-K，通过设置 top_k=50：

# set seed to reproduce results. Feel free to change the seed though to get different results
set_seed(42)# set top_k to 50
sample_output = model.generate(**model_inputs,max_new_tokens=40,do_sample=True,top_k=50
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))# Output:
# ----------------------------------------------------------------------------------------------------
#I enjoy walking with my cute dog for the rest of the day, but this time it was hard for me to figure out what to do with it. (One reason I asked this for a few months back is that I had a

非常不错！这段文本可以说是到目前为止最具人类风格的文本。然而，对于 Top-K 采样的一个关注点是它并不动态地调整从下一个词概率分布 $P(w|w_{1:t-1})$ 中被过滤掉的词的数量。这可能是有问题的，因为一些词可能是从非常尖锐的分布中进行采样（上图中的右侧分布），而其他词则是从更加平坦的分布中进行采样（上图中的左侧分布）。

在步骤 t=1 时，Top-K 消除了采样词池中的 (“people”, “big”, “house”, “cat”)，这些似乎是合理的候选词。另一方面，在步骤 t=2 时，该方法将 (“down”, “a”) 这两个不太合适的词包括在采样词池中。因此，将采样词池限制为固定大小的 K 可能会导致模型在尖锐分布中产生胡言乱语，并限制模型在平坦分布中的创造力。这种直觉导致 Ari Holtzman 等人（2019年）创建了Top-p (或nucleus) 采样方法。

六、Top-p (nucleus) sampling

与仅从最可能的 K 个单词中进行采样不同，Top-p 采样从概率累积超过概率 p 的可能性最小的单词集中进行选择。然后，概率质量在该单词集中重新分配。这样，单词集的大小（即单词集中的单词数）可以根据下一个单词的概率分布动态增加和减少。让我们用以下图形来说明。
在这里插入图片描述
**设定 p=0.92，Top-p采样选择最少数量的单词，以使其共同超过 92% 的概率质量，定义为
。在第一个例子中，这包括了 9 个最可能的单词，而在第二个例子中，它只需要选择前 3 个单词就可以超过 92%。**实际上非常简单！可以看到，它保留了一系列下一个单词可能性较低的单词，例如 P(w∣"The")，以及当下一个单词似乎更可预测时只有几个单词，例如 P(w∣"The",“car”)。

好了，现在在 transformers 库中尝试一下！通过设置 0 < top_p < 1，我们激活了 Top-p 采样：

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(input_ids,do_sample=True,max_length=50,top_p=0.92,top_k=0
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
# Output:
# ----------------------------------------------------------------------------------------------------
# I enjoy walking with my cute dog. He will never be the same. I watch him play.# Guys, my dog needs a name. Especially if he is found with wings.# What was that? I had a lot o

太棒了，听起来就像是人类写的一样。嗯，也许还不完全是。

理论上，Top-p方法似乎比Top-K更优雅，但两种方法在实践中都表现良好。Top-p还可以与Top-K结合使用，这可以避免非常低排名的单词，同时允许一些动态选择。

最后，要获得多个独立采样的输出，我们可以再次设置参数 num_return_sequences > 1：

# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(input_ids,do_sample=True,max_length=50,top_k=50,top_p=0.95,num_return_sequences=3
)print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
# Output:
# ----------------------------------------------------------------------------------------------------
# 0: I enjoy walking with my cute dog. It's so good to have the chance to walk with a dog. But I have this problem with the dog and how he's always looking at us and always trying to make me see that I can do something
# 1: I enjoy walking with my cute dog, she loves taking trips to different places on the planet, even in the desert! The world isn't big enough for us to travel by the bus with our beloved pup, but that's where I find my love
# 2: I enjoy walking with my cute dog and playing with our kids," said David J. Smith, director of the Humane Society of the US.# "So as a result, I've got more work in my time," he said.

太酷了，现在你应该掌握了使用transformers让模型为你写故事的所有工具！

七、总结

Top-p 和 Top-K 采样似乎比传统的 gready search (贪婪搜索) 和 beam search (束搜索) 在开放式语言生成中产生更流畅的文本。然而，最近有更多的证据表明，gready search (贪婪搜索) 和beam search (束搜索) 的明显缺陷（主要是生成重复的单词序列）是由模型（尤其是模型的训练方式）引起的，而不是解码方法引起的，参见 Welleck等人（2019年）。此外，正如 Welleck等人（2020年）所示，Top-K和Top-p采样也存在生成重复单词序列的问题。

在 Welleck等人（2019年）的研究中，作者们通过人类评估显示，当调整模型的训练，beam search (束搜索) 可以生成比Top-p采样更流畅的文本。

开放式语言生成是一个快速发展的研究领域，通常情况下，并不存在一种适用于所有情况的解码方法，因此必须根据具体使用情况找到最佳方法。

参考：

How to generate text: using different decoding methods for language generation with Transformers
generate text的正确姿势
Huggingface 的 generate 方法介绍：top_p sampling、top_k sampling、greedy_search、beam_search