LLM实战:LLM微调加速神器-Unsloth + LLama3

1. 背景

五一结束后,本qiang~又投入了LLM的技术海洋中,本期将给大家带来LLM微调神器:Unsloth。

正如Unsloth官方的对外宣贯:Easily finetune & train LLMs; Get faster with unsloth。微调训练LLM,可以显著提升速度,其次显存占用也会显著减少。

但有一点需要说明:unsloth目前开源部分只支持单机版微调,更高效微调只能交费使用unsloth pro

2. Unsloth简介

2.1 主要特性

(1) 所有的内核均以OpenAITriton语言实现,并且手动实现反向传播引擎。Triton语言是面向LLM训练加速。

(2) 准确率0损失,没有近似方法,方法完全一致。

(3) 硬件层面无需变动。支持18年之后的Nvidia GPU(V100, T4, Titan V, RTX20,30,40x, A100, H100, L40等,GTX1070,1080也支撑,但比较慢)Cuda最低兼容版本是7.0

(4) 通过WSL适用于LinuxWindows

(5) 基于bisandbytes包,支持4bit16bit QLoRA/LoRA微调

(6) 开源代码有5倍的训练效率提升, Unsloth Pro可以提升至30

2.2 目前支撑的模型

由于底层算子需要使用triton重写,因此部分开源模型的适配工作周期可能较长。当前unsloth支持的模型包含Qwen 1.5(7B, 14B, 32B, 72B), Llama3-8B, Mistral-7B, Gemma-7B, ORPO, DPO Zephyr, Phi-3(3.8B), TinyLlama

2.3 模型加速效果

Qwen1.5-7B的集成是由Firefly作者封装并验证,性能提升30%+,显卡减少40%+,详见地址。

2.4 安装教程

conda create --name unsloth_env python=3.10conda activate unsloth_envconda install pytorch-cuda=<12.1/11.8> pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformerspip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"pip install --no-deps trl peft accelerate bitsandbytes

3. 实战

本着眼过千遍不如手过一遍的宗旨,本qiang~针对Unsloth做了一个对比实现。对比的实验环境分别为:P40, A40, A800,对比的模型使用的是出锅热乎的Llama3(8B)

3.1 比对维度

维度

说明

显卡

是否支持bf16

最大文本长度

max_seq_length

批次大小

per_device_train_batch_size

梯度累加步长

gradient_accumulation_steps

LoRA的rank

dropout

lora_droput

3.2 源码

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer, AutoModelForCausalLM, set_seed, AutoTokenizer, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
import gcset_seed(42)alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.### Instruction:{}### Input:{}### Response:{}"""def train_unsloth(dtype,max_seq_length,per_device_train_batch_size, gradient_accumulation_steps, rank,  lora_alpha=16, lora_dropout=0, max_steps=50, save_steps=50,seed=42,warmup_steps=5,learning_rate=2e-4,logging_steps=5):"""使用unsloth进行微调训练"""print(f'dtype:{dtype}, max_seq_length:{max_seq_length}, per_device_train_batch_size:{per_device_train_batch_size}, gradient_accumulation_steps:{gradient_accumulation_steps}, rank:{rank}, lora_dropout:{lora_dropout}')load_in_4bit = Truemodel, tokenizer = FastLanguageModel.from_pretrained(model_name='pretrain_models/llama/llama3-8B-Instruct',max_seq_length=max_seq_length,dtype=dtype,load_in_4bit=load_in_4bit)model = FastLanguageModel.get_peft_model(model,r = rank,target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],lora_alpha=lora_alpha,lora_dropout=lora_dropout,bias='none',use_gradient_checkpointing=True,random_state=seed,use_rslora=False)EOS_TOKEN = tokenizer.eos_tokendef formatting_prompts_func(examples):instructions = examples["instruction"]inputs       = examples["input"]outputs      = examples["output"]texts = []for instruction, input, output in zip(instructions, inputs, outputs):# Must add EOS_TOKEN, otherwise your generation will go on forever!text = alpaca_prompt.format(instruction, input, output) + EOS_TOKENtexts.append(text)return { "text" : texts}passdataset = load_dataset("yahma/alpaca-cleaned", split = "train")dataset = dataset.map(formatting_prompts_func, batched = True)trainer = SFTTrainer(model=model,tokenizer=tokenizer,train_dataset=dataset,dataset_text_field='text',max_seq_length=max_seq_length,packing=False,args = TrainingArguments(per_device_train_batch_size=per_device_train_batch_size,gradient_accumulation_steps=gradient_accumulation_steps,warmup_steps=warmup_steps,learning_rate=learning_rate,fp16 = not torch.cuda.is_bf16_supported(),bf16 = torch.cuda.is_bf16_supported(),logging_steps=logging_steps,optim='adamw_8bit',weight_decay=0.01,lr_scheduler_type='linear',seed=seed,output_dir='output/llame3-8b-instruct-unsloth',save_steps=save_steps,max_steps=max_steps))gpu_stats = torch.cuda.get_device_properties(0)start_gpu_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)max_memory = round(gpu_stats.total_memory/1024/1024/1024, 3)print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")print(f"{start_gpu_memory} GB of memory reserved.")trainer_stats = trainer.train()used_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)used_memory_for_lora = round(used_memory - start_gpu_memory)used_percentage = round(used_memory/max_memory*100, 3)lora_percentage = round(used_memory_for_lora/max_memory*100, 3)print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")print(f"Peak reserved memory = {used_memory} GB.")print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")print(f"Peak reserved memory % of max memory = {used_percentage} %.")print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")model.save_pretrained("output/llame3-8b-instruct-unsloth-lora") # Local savingtokenizer.save_pretrained("output/llame3-8b-instruct-unsloth-lora")# model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)  # Merge to 16bit# model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",) # Merge to 4bit# model.save_pretrained_merged("model", tokenizer, save_method = "lora",) # Just LoRA adapters# model.save_pretrained_gguf("model", tokenizer,)   # Save to 8bit Q8_0# model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")   # Save to 16bit GGUF# model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")    # Save to q4_k_m GGUFdel modeldel tokenizertorch.cuda.empty_cache()for _ in range(3):gc.collect()def train_trans(dtype, max_seq_length, per_device_train_batch_size, gradient_accumulation_steps, rank, lora_alpha=16, lora_dropout=0, max_steps=50, save_steps=50,seed=42,warmup_steps=5,learning_rate=2e-4,logging_steps=5):"""使用transformers进行微调训练"""print(f'dtype:{dtype}, max_seq_length:{max_seq_length}, per_device_train_batch_size:{per_device_train_batch_size}, gradient_accumulation_steps:{gradient_accumulation_steps}, rank:{rank}, lora_dropout:{lora_dropout}')model_path = 'pretrain_models/llama/llama3-8B-Instruct'tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side='right', model_max_length=8192)tokenizer.add_special_tokens({"pad_token" : '<|reserved_special_token_250|>'})tokenizer.pad_token = '<|reserved_special_token_250|>'quantization_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=dtype,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type="nf4",llm_int8_threshold=6.0,llm_int8_has_fp16_weight=False,)model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=dtype,quantization_config=quantization_config)model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)model.enable_input_require_grads()config = LoraConfig(r=rank,lora_alpha=lora_alpha,target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],lora_dropout=lora_dropout,bias="none",task_type="CAUSAL_LM",use_rslora=False)model = get_peft_model(model, peft_config=config)model.gradient_checkpointing_enable()EOS_TOKEN = tokenizer.eos_tokendef formatting_prompts_func(examples):instructions = examples["instruction"]inputs       = examples["input"]outputs      = examples["output"]texts = []for instruction, input, output in zip(instructions, inputs, outputs):# Must add EOS_TOKEN, otherwise your generation will go on forever!text = alpaca_prompt.format(instruction, input, output) + EOS_TOKENtexts.append(text)return { "text" : texts}passdataset = load_dataset("yahma/alpaca-cleaned", split = "train")dataset = dataset.map(formatting_prompts_func, batched = True,)trainer = SFTTrainer(model=model,tokenizer=tokenizer,train_dataset=dataset,dataset_text_field='text',max_seq_length=max_seq_length,packing=False,args = TrainingArguments(per_device_train_batch_size=per_device_train_batch_size,gradient_accumulation_steps=gradient_accumulation_steps,warmup_steps=warmup_steps,learning_rate=learning_rate,fp16 = not torch.cuda.is_bf16_supported(),bf16 = torch.cuda.is_bf16_supported(),logging_steps=logging_steps,optim='adamw_8bit',weight_decay=0.01,lr_scheduler_type='linear',seed=seed,output_dir='output/llame3-8b-instruct-unsloth',save_steps=save_steps,max_steps=max_steps))gpu_stats = torch.cuda.get_device_properties(0)start_gpu_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)max_memory = round(gpu_stats.total_memory/1024/1024/1024, 3)print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")print(f"{start_gpu_memory} GB of memory reserved.")trainer_stats = trainer.train()used_memory = round(torch.cuda.max_memory_reserved()/1024/1024/1024, 3)used_memory_for_lora = round(used_memory - start_gpu_memory)used_percentage = round(used_memory/max_memory*100, 3)lora_percentage = round(used_memory_for_lora/max_memory*100, 3)print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")print(f"Peak reserved memory = {used_memory} GB.")print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")print(f"Peak reserved memory % of max memory = {used_percentage} %.")print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")model.save_pretrained("output/llame3-8b-instruct-unsloth-lora") # Local savingtokenizer.save_pretrained("output/llame3-8b-instruct-unsloth-lora")del modeldel tokenizertorch.cuda.empty_cache()for _ in range(3):gc.collect()def infer():model, tokenizer = FastLanguageModel.from_pretrained(model_name='output/llame3-8b-instruct-unsloth-lora',max_seq_length=2048,dtype=torch.float16,load_in_4bit=True)# 2x的速率进行推理FastLanguageModel.for_inference(model)inputs = tokenizer([alpaca_prompt.format('Continue the fibonnaci sequence.', '1, 1, 2, 3, 5, 8', '')], return_tensors = "pt").to('cuda')outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True)print(tokenizer.batch_decode(outputs))text_streamer = TextStreamer(tokenizer)outputs = model.generate(**inputs, max_new_tokens=1024, streamer=text_streamer)print(tokenizer.batch_decode(outputs))if __name__ == '__main__':train_unsloth(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=8, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0.05)train_unsloth(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=16, gradient_accumulation_steps=4, rank=64, lora_dropout=0.05)train_trans(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=8, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=1024, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=1, gradient_accumulation_steps=16, rank=64, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0)train_trans(dtype=torch.bfloat16, max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, rank=64, lora_dropout=0.05)

4 实验结果

4.1 P40

4.2 A40

4.3 A800

4.4 结论

针对于llama3-8B进行unsloth训练,与基于transformers框架训练进行比对,结论如下:

(1) 集成unsloth后,显卡占用确实更少,训练效率确实更快,不管是哪种维度。

(2) P40增加batch_size后,显卡的内存占用提升,但训练的时间也更长,说明P40针对大批次的数据处理,性能会降低; A40, A800增加batch_size后,显卡内存占用虽然提升,但训练的时间更短。

(3) A800batch_size1时,训练效率不如A40,当batch_size增加到16时,A800的训练效率比A40快接近一倍。因此,A800更适合处理大批次的场景,对于小batch_size,杀鸡不能用牛刀。

5. 总结

一句话足矣~

本文主要是使用unsloth框架针对llama3的高效微调实验,提供了详细的对比代码以及对比分析结果。

之后会写一篇关于Qwen1.5的对比实验,敬请期待~

6. 参考

1. unsloth: https://github.com/unslothai/unsloth

2. Qwen1.5+Unsloth: Support Qwen2 by yangjianxin1 · Pull Request #428 · unslothai/unsloth · GitHub


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/697092.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【Linux操作系统】:文件操作

目录 前言 一、C语言中文件IO操作 1.文件的打开方式 2.fopen&#xff1a;打开文件 3.fread&#xff1a;读文件 4.fwrite:写文件 二、系统文件I/O 1.系统调用open、read、write 2.文件描述符fd 3.文件描述符的分配规则 4.重定向 5.缓冲区 6.理解文件系统 磁盘 磁盘…

BGP—边界网关协议

BGP 动态路由协议可以按照工作范围分为IGP以及EGP。IGP工作在同一个AS内&#xff0c;主要用来发现和计算路由&#xff0c;为AS内提供路由信息的交换&#xff1b;而EGP工作在AS与AS之间&#xff0c;在AS间提供无环路的路由信息交换&#xff0c;BGP则是EGP的一种。 BGP是一…

【命名空间】(中北大学-程序设计基础(2))

目录 题目 源码 结果示例 题目 学校的人事部门保存了有关学生的部分数据&#xff08;学号、姓名、年龄、住址&#xff09;&#xff0c;教务部门也保存了学生的另外一些部分数据&#xff08;学号、姓名、性别、成绩&#xff09;&#xff0c;两个部门分别编写了本部门的数据管…

制造业精益生产KPI和智慧供应链管理方案和实践案例分享

随着工业4.0的推进和国家对制造业高质量发展的重视&#xff0c;工业数据已跃升为生产经营活动中不可或缺的核心要素&#xff0c;同时&#xff0c;工业数据也是形成新质生产力的优质生产要素&#xff0c;助力企业实现高效精益生产。 工业数据在制造业中的作用不可忽视&#xff…

【数据分析面试】42.用户流失预测模型搭建(资料数据分享)

题目 保持高的客户留存率可以稳定和提到企业的收入。因此&#xff0c;预测和防止客户流失是在业务中常见的一项数据分析任务。这次分享的数据集包括了电信行业、银行、人力资源和电商行业&#xff0c;涵盖了不同业务背景下的流失预测数据。 后台回复暗号&#xff08;在本文末…

向银行家应用程序添加日期

● 首先我们将下面图片上的时间更换成现在的时间 const now new Date(); const day now.getDate(); const month now.getMonth() 1; const year now.getFullYear(); const hour now.getHours(); const min now.getMinutes();labelDate.textContent ${day}/${month}/$…

【Unity从零开始学习制作手机游戏】第01节:控制3D胶囊体运动

1. 新建Project L01 使用3D Mobile模板。 2. 建立一个平面&#xff0c;用来承载物体 3. 导入Unity库内的胶囊体 下载 StandardAssets https://download.unitychina.cn/download_unity/e80cc3114ac1/WindowsStandardAssetsInstaller/UnityStandardAssetsSetup-5.6.7f1.exe …

在excel的内置瀑布图模板中,能在数据标签里同时显示数值和百分比吗?

瀑布图是由麦肯锡顾问公司所创的图表类型&#xff0c;因为形似瀑布流水而称之为瀑布图( Waterfall Plot)。这种图表常用于表达数个特定数值之间的数量增减变化关系。 在Excel中&#xff0c;瀑布图是可以通过簇状柱形图来完成创建。从excel2016版起&#xff0c;excel添加了内置…

幻兽帕鲁(公益入库)教程

先安装“SteamtoolsSetup”&#xff0c; 安装好桌面会出来个steam图标的。然后打开“幻兽帕鲁文件夹” 把那2个脚本拖进去那个steam图标。只要显示“已编译了1个Lua脚本”“已更新了1个清单文件”将在Steam重启后生效。然后退出steam&#xff0c;然后重启steam就可以了&#xf…

有哪些网络兼职适合大学生参与?揭秘几个简单又实用的兼职机会

有哪些网络兼职适合大学生参与&#xff1f;揭秘几个简单又实用的兼职机会 对于大学生而言&#xff0c;除了专注于学业&#xff0c;利用空余时间参与一些网络兼职&#xff0c;不仅能锻炼个人技能&#xff0c;还能为未来的职业生涯积累宝贵的经验。想象一下&#xff0c;步入社会…

10W 3KVAC隔离 宽电压输入 AC/DC 电源模块——TP10AF系列

TP10AF系列输出功率为10W&#xff0c;具有可靠性高、更小的体积、性价比高等特点&#xff0c;广泛用于工控和电力仪器、仪表、智能家居等相关行业。

需求规格说明书设计规范(编制实际项目案例-word)

二、 项目概述 2.1 项目背景 2.2 现状分析 2.2.1 业务现状 2.2.2 系统现状 三、 总体需求 3.1 系统范围 3.2 系统功能 3.3 用户分析 3.4 假设与依赖关系 四、 功能需求 五、 非功能性需求 5.1 用户界面需求 5.2 软硬件环境需求 5.3 产品质量需求 5.4 接口需求 5.5 其他需求 六、…