（十九）transformers解码策略

news/2025/2/21 3:48:33/文章来源:https://www.cnblogs.com/zhangxianrong/p/18384993

文本生成策略

文本生成对于许多 NLP 任务至关重要，例如开放式文本生成、摘要、翻译和更多。它还在各种混合模态应用程序中发挥作用，这些应用程序将文本作为输出，例如语音到文本以及 vision-to-text。一些可以生成文本的模型包括 GPT2、XLNet、OpenAI GPT、CTRL、TransformerXL、XLM、Bart、T5、GIT、Whisper。

查看一些使用 generate（）方法生成不同任务的文本输出：

文本摘要
图像字幕
音频转录

请注意，generate 方法的输入取决于模型的模态。它们由模型的预处理器返回类，例如 AutoTokenizer 或 AutoProcessor。如果模型的预处理器创建多种类型的 input，则传递所有 generate（）的输入。您可以在相应模型的文档中了解有关单个模型的预处理器的更多信息。

选择输出标记以生成文本的过程称为解码，您可以自定义解码策略方法将使用的修改解码策略不会更改任何可训练参数的值。但是，它可能会对生成的输出的质量产生显著影响。它可以帮助减少文本中的重复并使其更加连贯。generate()

本指南介绍：

默认生成配置
常见的解码策略及其主要参数
在 Hub 上🤗保存和共享自定义生成配置以及

默认文本生成配置

模型的解码策略在其生成配置中定义。使用预先训练的模型进行推理时在 pipeline（）中，模型调用应用默认生成的方法配置。当没有保存自定义配置时，也会使用默认配置模型。PreTrainedModel.generate()

显式加载模型时，可以通过以下方式检查它附带的生成配置：model.generation_config

>>> from transformers import AutoModelForCausalLM>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2")
>>> model.generation_config
GenerationConfig {"bos_token_id": 50256,"eos_token_id": 50256
}
<BLANKLINE>

打印仅显示与默认生成不同的值 configuration，并且不列出任何默认值。model.generation_config

默认生成配置将与输入提示符组合的输出大小限制为最大 20 令牌，以避免遇到资源限制。默认的解码策略是贪婪搜索，这是最简单的解码策略，它选择概率最高的 Token 作为下一个 Token。适用于许多任务和小输出大小，这效果很好。但是，当用于生成较长的输出时，可以启动贪婪搜索产生高度重复的结果。

自定义文本生成

您可以通过将参数及其值直接传递给方法来覆盖 any：generation_configgenerate

>>> my_model.generate(**inputs, num_beams=4, do_sample=True)

即使默认解码策略主要适用于您的任务，您仍然可以调整一些内容。一些通常调整的参数包括：

max_new_tokens：需要生成的 Token 数量上限。换句话说，输出序列的大小，而不是在提示符中包含标记。作为使用输出长度作为停止条件的替代方法，您可以选择在完整生成超过一定时间时停止生成。要了解更多信息，请查看 StoppingCriteria。
num_beams：通过指定大于 1 的光束数，您可以有效地从贪婪搜索切换到光束搜索。此策略在每个时间步长评估多个假设，并最终选择具有整个序列的总体最高概率。这样做的好处是可以识别高概率以较低概率的初始标记开头的序列，并且会被贪婪搜索忽略。在此处可视化其工作原理。
do_sample：如果设置为，则此参数将启用多项式采样、波束搜索等解码策略多项式抽样、Top-K 抽样和 Top-p 抽样。所有这些策略都从概率中选择下一个代币通过各种特定于策略的调整在整个词汇表上分配。True
num_return_sequences：为每个输入返回的候选序列数。此选项仅适用于支持多个候选序列的解码策略，例如光束搜索和采样的变体。译码贪婪搜索和对比搜索等策略返回单个输出序列。

将自定义解码策略与模型一起保存

如果您想与特定的代次配置共享微调后的模型，您可以：

创建 GenerationConfig 类实例
指定解码策略参数
使用 GenerationConfig.save_pretrained（）保存你的生成配置，确保其参数为空config_file_name
设置为 to 将你的配置上传到模型的仓库push_to_hubTrue

>>> from transformers import AutoModelForCausalLM, GenerationConfig>>> model = AutoModelForCausalLM.from_pretrained("my_account/my_model")
>>> generation_config = GenerationConfig(
...     max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
... )
>>> generation_config.save_pretrained("my_account/my_model", push_to_hub=True)

你还可以将多个生成配置存储在一个目录中，使用 GenerationConfig.save_pretrained（）中的参数。稍后可以使用 GenerationConfig.from_pretrained（）实例化它们。如果您想为单个模型存储多个生成配置（例如，一个用于使用采样生成创意文本，以及一个用于使用 Beam Search 进行汇总）。您必须具有正确的 Hub 权限才能将配置文件添加到模型。config_file_name

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig>>> tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")>>> translation_generation_config = GenerationConfig(
...     num_beams=4,
...     early_stopping=True,
...     decoder_start_token_id=0,
...     eos_token_id=model.config.eos_token_id,
...     pad_token=model.config.pad_token_id,
... )>>> # Tip: add `push_to_hub=True` to push to the Hub
>>> translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")>>> # You could then use the named generation config file to parameterize generation
>>> generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
>>> inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
>>> outputs = model.generate(**inputs, generation_config=generation_config)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['Les fichiers de configuration sont faciles à utiliser!']

流

它支持通过其输入进行流式处理。输入与任何实例兼容从具有以下方法的类中：和 .在内部，用于推送新令牌，并用于标记文本生成的结束。generate()streamerstreamerput()end()put()end()

streamer 类的 API 仍在开发中，将来可能会发生变化。

在实践中，您可以为各种目的制作自己的流式传输类！我们也有基本的流类随时供您使用。例如，您可以使用 TextStreamer 类将的输出流式传输到您的屏幕，一次一个单词：generate()

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer>>> tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
>>> streamer = TextStreamer(tok)>>> # Despite returning the usual output, the streamer will also print the generated text to stdout.
>>> _ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,

水印

支持通过将部分标记随机标记为 “green” 来为生成的文本添加水印。生成时，“green” 将在其 logit 上添加一个小的 'bias' 值，因此有更高的几率生成。可以通过计算文本中“绿色”标记的比例并估计其可能性来检测带水印的文本从统计上获取人工生成文本的 “绿色” 标记数量。这种水印策略是在论文 “On the Reliability of Watermarks for Large Language Models” 中提出的。有关更多信息关于水印的内部功能，建议参考论文。generate()

水印可以与任何生成模型一起使用，并且不需要额外的分类模型以检测带水印的文本。要触发水印，请将带有所需参数的 WatermarkingConfig 直接传入该方法，或将其添加到 GenerationConfig。带水印的文本稍后可以使用 WatermarkDetector 进行检测。tranformers.generate()

WatermarkDetector 内部依赖于 “绿色” 标记的比例，以及生成的文本是否遵循着色模式。这就是为什么建议剥离提示文本的原因，如果它比生成的文本长得多。当批处理中的一个序列较长导致填充其他行时，这也会产生影响。此外，必须使用生成时使用的相同 watermark 配置参数启动 detector。

让我们生成一些带有水印的文本。在下面的代码片段中，我们将偏差设置为 2.5，这是一个将添加到 “green” 代币的 logit 中。生成带水印的文本后，我们可以将其直接传递给以检查文本是否是机器生成的（机器生成的输出等）。WatermarkDetectorTrueFalse

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, WatermarkDetector, WatermarkingConfig>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
>>> tok.pad_token_id = tok.eos_token_id
>>> tok.padding_side = "left">>> inputs = tok(["This is the beginning of a long story", "Alice and Bob are"], padding=True, return_tensors="pt")
>>> input_len = inputs["input_ids"].shape[-1]>>> watermarking_config = WatermarkingConfig(bias=2.5, seeding_scheme="selfhash")
>>> out = model.generate(**inputs, watermarking_config=watermarking_config, do_sample=False, max_length=20)>>> detector = WatermarkDetector(model_config=model.config, device="cpu", watermarking_config=watermarking_config)
>>> detection_out = detector(out, return_dict=True)
>>> detection_out.prediction
array([True, True])

解码策略

参数的某些组合（最终）可用于启用特定的解码策略。如果您不熟悉这个概念，我们建议您阅读这篇博文，其中说明了常见的解码策略是如何工作的。generate()generation_config

在这里，我们将展示一些控制解码策略的参数，并说明如何使用它们。

选择给定的解码策略并不是影响模型结果的唯一方式。解码策略（主要）基于 logits、下一个 token 的概率分布，以及因此，选择一个好的 Logits 操作策略可以大有帮助！换句话说，操纵 logit 是另一个维度，除了选择解码策略外，您还可以根据维度进行操作。流行的 logits 操作策略包括、和 — 您可以在 GenerationConfig 类中查看完整列表。generate()top_pmin_prepetition_penalty

贪婪搜索

generate默认情况下使用贪婪搜索解码，因此您不必传递任何参数即可启用它。这意味着参数设置为 1 和。num_beamsdo_sample=False

>>> from transformers import AutoModelForCausalLM, AutoTokenizer>>> prompt = "I look forward to"
>>> checkpoint = "distilbert/distilgpt2">>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> outputs = model.generate(**inputs)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

对比搜索

对比搜索解码策略是在 2022 年的论文 A Contrastive Framework for Neural Text Generation 中提出的。它展示了生成非重复但连贯的长输出的卓越结果。了解如何对比搜索有效，请查看此博客文章。启用和控制对比搜索行为的两个主要参数是和：penalty_alphatop_k

>>> from transformers import AutoTokenizer, AutoModelForCausalLM>>> checkpoint = "openai-community/gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)>>> prompt = "Hugging Face Company is"
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best
in the business and our customer service is second to none.\n\nIf you have any questions about our
products or services, feel free to contact us at any time. We look forward to hearing from you!']

多项式采样

与贪婪搜索相反，贪婪搜索总是选择概率最高的 token 作为 next token，多项式抽样（也称为祖先抽样）根据整个模型给出的词汇。每个概率为非零的 Token 都有几率被选中，从而减少重复的风险。

要启用多项式抽样集和 .do_sample=Truenum_beams=1

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
>>> set_seed(0)  # For reproducibility>>> checkpoint = "openai-community/gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)>>> prompt = "Today was an amazing day because"
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Today was an amazing day because we received these wonderful items by the way of a gift shop. The box arrived on a Thursday and I opened it on Monday afternoon to receive the gifts. Both bags featured pieces from all the previous years!\n\nThe box had lots of surprises in it, including some sweet little mini chocolate chips! I don't think I'd eat all of these. This was definitely one of the most expensive presents I have ever got, I actually got most of them for free!\n\nThe first package came"]

波束搜索解码

与贪婪搜索不同，光束搜索解码在每个时间步长保留多个假设并最终选择整个序列的总体概率最高的假设。这样做的好处是可以识别高概率以较低概率的初始标记开头的序列，并且会被贪婪搜索忽略。

您可以在此交互式演示中直观地了解波束搜索解码的工作原理：键入输入的句子，并使用参数查看解码波束如何变化。

要启用此解码策略，请指定大于 1 的（aka number of hypotheses to keep track of）。num_beams

>>> from transformers import AutoModelForCausalLM, AutoTokenizer>>> prompt = "It is astonishing how one can"
>>> checkpoint = "openai-community/gpt2-medium">>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)>>> outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of
time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']

波束搜索多项式采样

顾名思义，这种解码策略将波束搜索与多项式采样相结合。您需要指定大于 1，并设置为使用此解码策略。num_beamsdo_sample=True

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
>>> set_seed(0)  # For reproducibility>>> prompt = "translate English to German: The house is wonderful."
>>> checkpoint = "google-t5/t5-small">>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)>>> outputs = model.generate(**inputs, num_beams=5, do_sample=True)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Das Haus ist wunderbar.'

多样化的波束搜索解码

diverse 波束搜索解码策略是波束搜索策略的扩展，允许生成更多样化的可供选择的光束序列集。要了解它是如何工作的，请参阅 Diverse Beam Search： Decoding Diverse Solutions from Neural Sequence Models。此方法有三个主要参数：、和。分集惩罚确保不同组的输出不同，并在每个组内使用光束搜索。num_beamsnum_beam_groupsdiversity_penalty

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM>>> checkpoint = "google/pegasus-xsum"
>>> prompt = (
...     "The Permaculture Design Principles are a set of universal design principles "
...     "that can be applied to any location, climate and culture, and they allow us to design "
...     "the most efficient and sustainable human habitation and food production systems. "
...     "Permaculture is a design system that encompasses a wide variety of disciplines, such "
...     "as ecology, landscape design, environmental science and energy conservation, and the "
...     "Permaculture design principles are drawn from these various disciplines. Each individual "
...     "design principle itself embodies a complete conceptual framework based on sound "
...     "scientific principles. When we bring all these separate  principles together, we can "
...     "create a design system that both looks at whole systems, the parts that these systems "
...     "consist of, and how those parts interact with each other to create a complex, dynamic, "
...     "living system. Each design principle serves as a tool that allows us to integrate all "
...     "the separate parts of a design, referred to as elements, into a functional, synergistic, "
...     "whole system, where the elements harmoniously interact and work together in the most "
...     "efficient way possible."
... )>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)>>> outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'The Design Principles are a set of universal design principles that can be applied to any location, climate and
culture, and they allow us to design the'

本指南说明了启用各种解码策略的主要参数。该方法存在更高级的参数，这使您可以进一步控制方法的行为。有关可用参数的完整列表，请参阅 API 文档。generategenerate

推测解码

推测解码（也称为辅助解码）是上述解码策略的修改版，它使用 assistant 模型（最好是一个小得多的模型）来生成一些候选 token。主要的然后，model 在单个前向传递中验证候选令牌，从而加快解码过程。如果，则使用推测解码论文中引入的带有重采样的令牌验证。do_sample=True

目前，辅助解码仅支持贪婪搜索和采样，而辅助解码不支持批量输入。要了解有关辅助解码的更多信息，请查看此博客文章。

要启用辅助解码，请使用 model 设置参数。assistant_model

>>> from transformers import AutoModelForCausalLM, AutoTokenizer>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped">>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

当使用带有采样方法的辅助解码时，你可以使用 argument 来控制随机性，就像在多项式采样中一样。但是，在辅助解码中，降低温度可能有助于改善延迟。temperature

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
>>> set_seed(42)  # For reproducibility>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped">>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob, a couple of friends of mine, who are both in the same office as']

或者，您还可以设置触发基于 n-gram 的辅助解码，而不是到基于模型的辅助解码。您可以在此处阅读更多相关信息。prompt_lookup_num_tokens

DoLa 解码

Contrasting Layers （DoLa）的 D 编码是一种对比性解码策略，用于提高事实性并减少正如 ICLR 2024 DoLa：通过对比层解码提高大型语言模型中的事实性而描述的 LLM 的幻觉。

DoLa 是通过对比从 final 获得的 logit 的差异来实现的层与早期层相比，从而放大了本地化于 transformer 层特定部分的事实知识。

执行以下两个步骤，在调用函数时激活 DoLa 解码：model.generate

设置参数，该参数可以是字符串或整数列表。dola_layers
- 如果设置为字符串，则它可以是、、之一。lowhigh
- 如果设置为整数列表，则它应该是介于 0 和模型中层总数之间的层索引列表。第 0 层是单词嵌入，第 1 层是第一个 transformer 层，依此类推。
建议使用 set 来减少 DoLa 解码中的重复。repetition_penalty = 1.2

有关使用 32 层 LLaMA-7B 模型进行 DoLa 解码的示例，请参阅以下示例。

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
>>> import torch>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
>>> model = AutoModelForCausalLM.from_pretrained("huggyllama/llama-7b", torch_dtype=torch.float16)
>>> device = 'cuda' if torch.cuda.is_available() else 'cpu'
>>> model.to(device)
>>> set_seed(42)>>> text = "On what date was the Declaration of Independence officially signed?"
>>> inputs = tokenizer(text, return_tensors="pt").to(device)# Vanilla greddy decoding
>>> vanilla_output = model.generate(**inputs, do_sample=False, max_new_tokens=50)
>>> tokenizer.batch_decode(vanilla_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
['\nThe Declaration of Independence was signed on July 4, 1776.\nWhat was the date of the signing of the Declaration of Independence?\nThe Declaration of Independence was signed on July 4,']# DoLa decoding with contrasting higher part of layers (layers 16,18,...,30)
>>> dola_high_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers='high')
>>> tokenizer.batch_decode(dola_high_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
['\nJuly 4, 1776, when the Continental Congress voted to separate from Great Britain. The 56 delegates to the Continental Congress signed the Declaration on August 2, 1776.']# DoLa decoding with contrasting specific layers (layers 28 and 30)
>>> dola_custom_output = model.generate(**inputs, do_sample=False, max_new_tokens=50, dola_layers=[28,30], repetition_penalty=1.2)
>>> tokenizer.batch_decode(dola_custom_output[:, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
['\nIt was officially signed on 2 August 1776, when 56 members of the Second Continental Congress, representing the original 13 American colonies, voted unanimously for the resolution for independence. The 2']

理解 dola_layers 参数

dola_layers代表过早层选择中的候选层，如 DoLa 论文中所述。选定的过早图层将与最终图层形成对比。

设置为或将分别选择要对比的图层的较低或较高部分。dola_layers'low''high'

对于具有 layers 的 -layer 模型，和的层分别用于和层。NN <= 40range(0, N // 2, 2)range(N // 2, N, 2)'low''high'
对于具有 layers 的模型，和的层分别用于和层。N > 40range(0, 20, 2)range(N - 20, N, 2)'low''high'
如果模型有绑定词嵌入，我们跳过词嵌入（第 0 层）并从第 2 层开始，因为提前退出词嵌入将成为恒等函数。
将设置为图层索引的整数列表，以对比手动指定的图层。例如，设置会将最后一个图层（第 32 个图层）与第 28 个和第 30 个图层进行对比。dola_layersdola_layers=[28,30]