使用LLM-Tuning实现百川和清华ChatGLM的Lora微调-编程知识

LLM-Tuning项目源码：

GitHub - beyondguo/LLM-Tuning: Tuning LLMs with no tears💦, sharing LLM-tools with love❤️.Tuning LLMs with no tears💦, sharing LLM-tools with love❤️. - GitHub - beyondguo/LLM-Tuning: Tuning LLMs with no tears💦, sharing LLM-tools with love❤️.https://github.com/beyondguo/LLM-Tuning

1、环境准备

训练主机配置：

NVIDIA A100-PCIE-40GB
Python 3.8.10
CUDA: 11.2

安装依赖库：

pip install xformers==0.0.20
pip install torch==2.0.1
pip install transformers==4.29.1
pip install datasets==2.12.0
pip install accelerate==0.19.0
pip install sentencepiece==0.1.99
pip install tensorboard==2.13.0
pip install peft==0.3.0

2、数据准备

原始文件的准备

指令微调数据一般有输入和输出两部分，输出则是希望模型的回答，统一使用json的格式在整理数据，可以自定义输出输出的字段名。

{"q": "请计算：39 * 0 = 什么？", "a": "这是简单的乘法运算，39乘以0得到的是0"}
{"q": "题目：51/186的答案是什么?", "a": "这是简单的除法运算，51除以186大概为0.274"}
{"q": "鹿妈妈买了24个苹果，她想平均分给她的3只小鹿吃，每只小鹿可以分到几个苹果？", "a": "鹿妈妈买了24个苹果，平均分给3只小鹿吃，那么每只小鹿可以分到的苹果数就是总苹果数除以小鹿的只数。\n24÷3=8\n每只小鹿可以分到8个苹果。所以，答案是每只小鹿可以分到8个苹果。"}
...

整理好数据后，保存为.json或者.jsonl文件，然后放入目录中的data/文件夹中。

对数据集进行分词

为了避免每次训练的时候都要重新对数据集分词，先分好词形成特征后保存成可直接用于训练的数据集。

例如：

原始指令微调文件为：data/ 文件夹下的 simple_math_4op.json 文件
输入字段为q，输出字段为a
希望经过 tokenize 之后保存到 data/tokenized_data/ 下名为 simple_math_4op 的文件夹中设定
文本最大程度为 2000

则我们可以直接使用下面这段命令(即tokenize.sh文件)进行处理：

CUDA_VISIBLE_DEVICES=0,1 python tokenize_dataset_rows.py \--model_checkpoint THUDM/chatglm-6b \--input_file simple_math_4op.json \--prompt_key q \--target_key a \--save_name simple_math_4op \--max_seq_length 2000 \--skip_overlength False

处理完毕之后，在 data/tokenized_data/ 下生成名为 simple_math_4op 的文件夹，这就是下一步中可以直接用于训练的数据。

对比不同的 LLM，需在 tokenize.sh 文件里切换 model_checkpoint 参数。

3、使用 LoRA 微调

得到 tokenize 之后的数据集，就可以直接运行 chatglm_lora_tuning.py 来训练 LoRA 模型了。

对于不同的 LLM，需切换不同的 python 文件来执行：

ChatGLM-6B 应使用 chatglm_lora_tuning.py
ChatGLM2-6B 应使用 chatglm2_lora_tuning.py
baichuan-7B 应使用 baichuan_lora_tuning.py
baichuan2-7B 应使用 baichuan2_lora_tuning.py
internlm-chat/base-7b 应使用 intermlm_lora_tuning.py
chinese-llama2/alpaca2-7b 应使用 chinese_llama2_alpaca2_lora_tuning.py

具体可设置的主要参数包括：

tokenized_dataset, 分词后的数据集，即在 data/tokenized_data/ 地址下的文件夹名称
lora_rank, 设置 LoRA 的秩，推荐为4或8，显存够的话使用8
per_device_train_batch_size, 每块 GPU 上的 batch size
gradient_accumulation_steps, 梯度累加，可以在不提升显存占用的情况下增大 batch size
max_steps, 训练步数
save_steps, 多少步保存一次
save_total_limit, 保存多少个checkpoint
logging_steps, 多少步打印一次训练情况(loss, lr, etc.)
output_dir, 模型文件保存地址

例如我们的数据集为 simple_math_4op，希望保存到 weights/simple_math_4op ，则执行下面命令(即train.sh文件)：

CUDA_VISIBLE_DEVICES=0,1 python chatglm_lora_tuning.py \--tokenized_dataset simple_math_4op \--lora_rank 8 \--per_device_train_batch_size 10 \--gradient_accumulation_steps 1 \--max_steps 100000 \--save_steps 200 \--save_total_limit 2 \--learning_rate 1e-4 \--fp16 \--remove_unused_columns false \--logging_steps 50 \--output_dir weights/simple_math_4op

训练完之后，可以在 output_dir 中找到 LoRA 的相关模型权重，主要是adapter_model.bin和adapter_config.json两个文件。

如何查看 tensorboard：

在 output_dir 中找到 runs 文件夹，复制其中日期最大的文件夹的地址，假设为 your_log_path
执行 tensorboard --logdir your_log_path 命令，就会在 http://localhost:6006/ 上开启tensorboard
如果是在服务器上开启，则还需要做端口映射到本地。
如果要自己手动进行端口映射，具体方式是在使用 ssh 登录时，后面加上 -L 6006:127.0.0.1:6006 参数，将服务器端的6006端口映射到本地的6006端口。

4、在本地大模型上加载LoRA并推理

把上面的 output_dir 打包带走，假设文件夹为 weights/simple_math_4op，其中（至少）包含 adapter_model.bin 和 adapter_config.json 两个文件，用下面的方式直接加载，并推理

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.generation.utils import GenerationConfig
import torchdevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")# 加载原始 LLM
model_path = "baichuan-inc/Baichuan2-7B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).half().to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)model.generation_config = GenerationConfig.from_pretrained(model_path)
messages = [{"role": "user", "content": "什么是视觉生成艺术"}]
output = model.chat(tokenizer, messages)
print(output)# 给原始 LLM 安装上你的 LoRA tool
model = PeftModel.from_pretrained(model, "weights/chat_messagge_4op").half().to(device)
output = model.chat(tokenizer, messages)
print(output)

理论上可以通过多次执行 model = PeftModel.from_pretrained(model, "weights/simple_math_4op").half() 的方式，加载多个 LoRA 模型，从而混合不同Tool的能力，但实际测试的时候，由于暂时还不支持设置不同 LoRA weights的权重，往往效果不太好，存在覆盖或者遗忘的情况。