LLM大模型:deepseek浅度解析(三):R1的reinforcement learning复现

   deepseek-R1比较创新的点就是reward函数了,其自创的GRPO方法,详解如下:https://www.cnblogs.com/theseventhson/p/18696408

  

  训练出了R1-zero和R1两个强化学习版本!幸运的是,GRPO的这个算法已经有人实现,并集成到huggingface啦,直接调用就行,demo在这里:https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb?permalink_comment_id=5417630

  1、训练肯定是要数据啦,这里用  https://huggingface.co/datasets/openai/gsm8k  这个数据集,长这样的:question、answer(reason + result) 左边是问题,右边是回答,回答有推理过程和最终结果

  

   数据集齐了,就要处理训练样本了,这个类似dataset的功能,代码如下:

#从answer中提取计算结果;这些都是数学题,最终答案都是一个数字作为ground truth,数字和reason之间用####隔开的,所以用####做分割
def extract_hash_answer(text: str) -> str | None:if "####" not in text:return Nonereturn text.split("####")[1].strip()# 构造prompt,单独抽取answer
def get_gsm8k_questions(split = "train") -> Dataset:data = load_dataset('openai/gsm8k', 'main')[split] # type: ignoredata = data.map(lambda x: { # type: ignore'prompt': [{'role': 'system', 'content': SYSTEM_PROMPT},{'role': 'user', 'content': x['question']}],'answer': extract_hash_answer(x['answer'])}) # type: ignorereturn data # type: ignore

  2、deepseek训练时,reward除了最终的结果外,还要求response的格式正确,比如要有reasoning过程,也要有最终的答案,所以这里要定义response的格式,如下:

import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer# Load and prep dataset:格式就是推理过程+最终结果
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:answer = text.split("<answer>")[-1]answer = answer.split("</answer>")[0]return answer.strip()

  最核心的部分来了,这个reward是怎么在代码层面实现的?针对gsm8k数据集的reward函数代码实现:

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:responses = [completion[0]['content'] for completion in completions]q = prompts[0][-1]['content']extracted_responses = [extract_xml_answer(r) for r in responses]#把问题、答案、LLM的回复、从回复中抽取的结果都打印出来print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")#如果LLM的结果和训练样本的答案是一样的,说明回答正确,reward=2,奖励2分return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]#训练样本最终的结果是个数字,所以要求LLM最终输出的结果是数字,才能奖励0.5,reward=0.5
def int_reward_func(completions, **kwargs) -> list[float]:responses = [completion[0]['content'] for completion in completions]extracted_responses = [extract_xml_answer(r) for r in responses]return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]#LLM的回复中如果有reasoning推理过程和answer结果标签,才符合既定的格式要求,这里reward=0.5
def strict_format_reward_func(completions, **kwargs) -> list[float]:"""Reward function that checks if the completion has a specific format."""pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"responses = [completion[0]["content"] for completion in completions]matches = [re.match(pattern, r) for r in responses]return [0.5 if match else 0.0 for match in matches]#同上,不过这里的正则检查没那么严格
def soft_format_reward_func(completions, **kwargs) -> list[float]:"""Reward function that checks if the completion has a specific format."""pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"responses = [completion[0]["content"] for completion in completions]matches = [re.match(pattern, r) for r in responses]return [0.5 if match else 0.0 for match in matches]def count_xml(text) -> float:count = 0.0if text.count("<reasoning>\n") == 1:count += 0.125if text.count("\n</reasoning>\n") == 1:count += 0.125if text.count("\n<answer>\n") == 1:count += 0.125count -= len(text.split("\n</answer>\n")[-1])*0.001if text.count("\n</answer>") == 1:count += 0.125count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001return countdef xmlcount_reward_func(completions, **kwargs) -> list[float]:contents = [completion[0]["content"] for completion in completions]return [count_xml(c) for c in contents]

  选择base model:

model_name = "Qwen/Qwen2.5-1.5B-Instruct" #可以按需换成其他的

output_dir="outputs/Qwen2.5-1.5B-Instruct-GRPO"
run_name="Qwen-1.5B-GRPO-gsm8k"training_args = GRPOConfig(output_dir=output_dir,run_name=run_name,learning_rate=5e-6,adam_beta1 = 0.9,adam_beta2 = 0.99,weight_decay = 0.1,warmup_ratio = 0.1,lr_scheduler_type='cosine',logging_steps=1,bf16=True,per_device_train_batch_size=1,gradient_accumulation_steps=4,#8k样本,这里有2k步梯度num_generations=16,max_prompt_length=256,max_completion_length=200,#reasoning长度限制num_train_epochs=1,save_steps=100,max_grad_norm=0.1,log_on_each_node=False,use_vllm=False,vllm_gpu_memory_utilization=.3,vllm_device="cuda:0",report_to="none" #disabling Wandb.
)model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.bfloat16,device_map=None
).to("cuda")tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

  直接用现成的接口开始训练了:每次会根据这5个reward的辅助函数计算reward值,然后update gradient

trainer = GRPOTrainer(model=model,processing_class=tokenizer,reward_funcs=[xmlcount_reward_func, #自定义的格式reward函数soft_format_reward_func,#自定义的格式reward函数strict_format_reward_func,#自定义的格式reward函数int_reward_func,#自定义的结果数字reward函数correctness_reward_func],#自定义的结果reward函数args=training_args,train_dataset=dataset,#peft_config=peft_config
)
trainer.train()trainer.save_model(output_dir)

  怎样,因为reward方法别人已经写好,自己直接调用GRPOTrainer就行了,是不是很简单了!

  3、结果解读

  循环刚开始的时候,base model的回答和微调前一样,没啥变化:可以看出base model的结果毫无格式和正确性而言,效果很差,只体现了model的基座能力!

-------------------- Question: #训练样本的问题
There are 15 tables in the school's cafeteria. Each table can seat 10 people. Usually, only 1/10 of the seats are left unseated. How many seats are usually taken? 
Answer: #训练样本的结果
135 
Response: # model的回复:毫无章法
To find out how many seats are usually taken, we can follow these steps:
1. **Calculate the total number of seats in the cafeteria:**- There are 15 tables, and each table seats 10 people.\[\text{Total seats} = 15 \text{ tables} \times 10 \text{ seats/table} = 150 \text{ seats}\]
2. **Determine the number of seats that are usually left unseated:**- Typically, only 1/10 of the seats are left unseated.\[\text{Unseated seats} = 150 \text{ seats} \times \frac{1}{10} = 15 \text{ seats}\]
3. **Calculate the number of seats that are usually taken:**- The total number of seats is 150, and 
Extracted: # model的结果
To find out how many seats are usually taken, we can follow these steps:
1. **Calculate the total number of seats in the cafeteria:**- There are 15 tables, and each table seats 10 people.\[\text{Total seats} = 15 \text{ tables} \times 10 \text{ seats/table} = 150 \text{ seats}\]
2. **Determine the number of seats that are usually left unseated:**- Typically, only 1/10 of the seats are left unseated.\[\text{Unseated seats} = 150 \text{ seats} \times \frac{1}{10} = 15 \text{ seats}\]
3. **Calculate the number of seats that are usually taken:**- The total number of seats is 150, and

  大概经历了100多轮后,model终于输出了要求的format:

-------------------- Question:
There are 18 green leaves on each of the 3 tea leaf plants. One-third of them turn yellow and fall off on each of the tea leaf plants.  How many green leaves are left on the tea leaf plants? 
Answer:
36 
Response:
<reasoning>
Initially, there are 18 * 3 = 54 green leaves on all the tea leaf plants. One-third of them turn yellow on each plant, which means 54 / 3 = 18 leaves turn yellow. After the yellow leaves turn yellow, there are 54 - 18 = 36 leaves left on all the tea leaf plants.</reasoning>
<answer>
36 green leaves
</answer> 
Extracted:
36 green leaves

   换句话说,前面100多轮基本就是瞎搞,纯属浪费算力,R1-zero估计面临的就是这种情况,所以继续训练R1的时候会基于base model做SFT(部分数据来自R1-zero的生成),再用GRPO做RL,可以极大减少前期的这种算力浪费

总结:

  1、reinforcement learning用在这里原理并不复杂, 其实很简单,只要定义好reward就行了!

 

参考:

1、https://www.bilibili.com/video/BV1XDPQeeEF7/?spm_id_from=333.999.0.0&vd_source=241a5bcb1c13e6828e519dd1f78f35b2     上手代码复现DeepSeek-R1强化学习训练演示

2、https://www.bilibili.com/video/BV13ZPdejE1K/?spm_id_from=333.788.recommend_more_video.3&vd_source=241a5bcb1c13e6828e519dd1f78f35b2    如何快速微调DeepSeek-R1-8b模型,并且可视化训练过程,赶紧行动起来

3、https://www.bilibili.com/video/BV13jPde5EPk/?spm_id_from=333.788.recommend_more_video.2&vd_source=241a5bcb1c13e6828e519dd1f78f35b2   大模型DeepSeek R1训练全流程流程详解

4、https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb?permalink_comment_id=5417630   GRPO强化学习算法demo

   https://github.com/waylandzhang/DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k/blob/main/train-checkpoint-900.ipynb   qianwen的base model复现GRPO算法

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/879187.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

并发编程 - 线程同步(三)之原子操作Interlocked简介

原子操作是不可分割的操作单元,Interlocked提供硬件级别原子操作,比传统锁机制效率高。Interlocked支持多种原子操作,如增减、替换、位操作等,确保多线程安全。上一章我们了解了3种处理多线程中共享资源安全的方法,今天我们将更近一步,学习一种针对简单线程同步场景的解决…

从易用性到高级分析:五款优秀报表软件盘点

本文将为大家介绍五款报表软件,详细描述它们的功能亮点和适用场景。山海鲸报表、Qlik Sense、Looker、Domo和Power BI分别在自助分析、实时数据访问、数据整合、可视化以及人工智能支持等方面展现了强大的功能。这些软件适用于不同规模的企业,能够帮助企业实现数据的可视化、…

阿里云可观测 2024 年 12 月产品动态

阿里云可观测 2024 年 12 月产品动态

pre-norm、post-norm

同一设置之下,Pre Norm结构往往更容易训练,但最终效果通常不如Post Norm参考资料 苏剑林. (Mar. 29, 2022). 《为什么Pre Norm的效果不如Post Norm? 》[Blog post]. Retrieved from https://kexue.fm/archives/9009为什么大模型结构设计中往往使用postNorm而不用preNorm?

连接VNC时出现attempting to reconnect to vnc server

参考 https://blog.csdn.net/weixin_46031767/article/details/128076399 使用VNC View连接kvm虚拟机无法出现画面提示 Attempting to reconnect to VNc Server.. Protocol error: invalid message type 255将画质调低解决该问题

【FMC173】l基于VITA57.1标准的4通道4GSPS AD采集、4通道12GSPS DA回放FMC子卡模块

产品概述 FMC173是一款基于VITA57.1标准的,实现4路12-bit、4GSPS ADC采集功能、4路16-bit 12GSPS DA回放的FMC子卡模块。该模块遵循VITA57.1(HPC)标准,搭配FPGA载板可以灵活的实现多通道高速采集与回放功能。模块的主芯片采用ADI公司高度集成的AD9081芯片,该芯片与ADI公司…

数字化办公新宠:文档协作工具如何重塑团队协作?

文档协作工具以其强大的功能为团队管理者带来了诸多便利。在选择适合的工具时,团队管理者应根据团队规模、需求以及预算等因素进行综合考虑。同时,还应关注工具的安全性、易用性以及与其他应用程序或系统的集成能力等方面。通过合理利用文档协作工具的功能优势,团队管理者可…

腾讯云HAI服务器上部署与调用DeepSeek-R1大模型的实战指南

上次我们大概了解了一下 DeepSeek-R1 大模型,并简单提及了 Ollama 的一些基本信息。今天,我们将深入实际操作,利用腾讯云的 HAI 服务器进行 5 分钟部署,并实现本地 DeepSeek-R1 大模型的实时调用。接下来,我们直接进入部署过程。 服务器准备 首先,我们需要登录腾讯云平台…

d2l-ResNet

动手学深度学习笔记-ResNet残差网络 ResNet 核心思想:每个附加层应该更容易地包含原始函数x作为其元素之一。

阮梅

气质温婉优雅的学者,「天才俱乐部」#81号会员,生命科学领域的专家。 凭借天赋与惊人的执著得到了博识尊的瞩目,在秘密的角落开始了对生命本源的研究与探索。 并因此被黑塔邀请,同螺丝咕姆、斯蒂芬联合开发了「模拟宇宙」。 私下里,她十分喜爱传统戏剧与点心,对刺绣也很感…

多头潜在注意力(Multi-Head Latent Attention,MLA)

在 DeepSeek 模型中,多头潜在注意力(Multi-Head Latent Attention,MLA) 是一种关键技术,旨在通过低秩压缩方法优化注意力机制的计算效率和内存使用。MLA 通过对键(Key)和值(Value)进行低秩联合压缩,显著减少了推理过程中的键值缓存(KV Cache),在保持模型性能的同时…

nexttrace :一款开源可视化的路由追踪工具

一、文章来源 今天要给大家推荐一个 GitHub 开源项目 sjlleo/nexttrace,该项目在 GitHub 有超过 700 Star,用一句话介绍该项目就是:“An open source visual route tracking CLI tool”,一款开源可视化的路由追踪工具。 https://www.ajmwz.com/15965.html二、工具介绍 next…