使用Accelerate库在多GPU上进行LLM推理

大型语言模型(llm)已经彻底改变了自然语言处理领域。随着这些模型在规模和复杂性上的增长,推理的计算需求也显著增加。为了应对这一挑战利用多个gpu变得至关重要。

所以本文将在多个gpu上并行执行推理,主要包括:Accelerate库介绍,简单的方法与工作代码示例和使用多个gpu的性能基准测试。

本文将使用多个3090将llama2-7b的推理扩展在多个GPU上

基本示例

我们首先介绍一个简单的示例来演示使用Accelerate进行多gpu“消息传递”。

 from accelerate import Acceleratorfrom accelerate.utils import gather_objectaccelerator = Accelerator()# each GPU creates a stringmessage=[ f"Hello this is GPU {accelerator.process_index}" ] # collect the messages from all GPUsmessages=gather_object(message)# output the messages only on the main process with accelerator.print() accelerator.print(messages)

输出如下:

 ['Hello this is GPU 0', 'Hello this is GPU 1', 'Hello this is GPU 2', 'Hello this is GPU 3', 'Hello this is GPU 4']

多GPU推理

下面是一个简单的、非批处理的推理方法。代码很简单,因为Accelerate库已经帮我们做了很多工作,我们直接使用就可以:

 from accelerate import Acceleratorfrom accelerate.utils import gather_objectfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom statistics import meanimport torch, time, jsonaccelerator = Accelerator()# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-booksprompts_all=["The King is dead. Long live the Queen.","Once there were four children whose names were Peter, Susan, Edmund, and Lucy.","The story so far: in the beginning, the universe was created.","It was a bright cold day in April, and the clocks were striking thirteen.","It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.","The sweat wis lashing oafay Sick Boy; he wis trembling.","124 was spiteful. Full of Baby's venom.","As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.","I write this sitting in the kitchen sink.","We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",] * 10# load a base model and tokenizermodel_path="models/llama2-7b"model = AutoModelForCausalLM.from_pretrained(model_path,    device_map={"": accelerator.process_index},torch_dtype=torch.bfloat16,)tokenizer = AutoTokenizer.from_pretrained(model_path)   # sync GPUs and start the timeraccelerator.wait_for_everyone()start=time.time()# divide the prompt list onto the available GPUs with accelerator.split_between_processes(prompts_all) as prompts:# store output of generations in dictresults=dict(outputs=[], num_tokens=0)# have each GPU do inference, prompt by promptfor prompt in prompts:prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=100)[0]# remove prompt from output output_tokenized=output_tokenized[len(prompt_tokenized["input_ids"][0]):]# store outputs and number of tokens in result{}results["outputs"].append( tokenizer.decode(output_tokenized) )results["num_tokens"] += len(output_tokenized)results=[ results ] # transform to list, otherwise gather_object() will not collect correctly# collect results from all the GPUsresults_gathered=gather_object(results)if accelerator.is_main_process:timediff=time.time()-startnum_tokens=sum([r["num_tokens"] for r in results_gathered ])print(f"tokens/sec: {num_tokens//timediff}, time {timediff}, total tokens {num_tokens}, total prompts {len(prompts_all)}")

使用多个gpu会导致一些通信开销:性能在4个gpu时呈线性增长,然后在这种特定设置中趋于稳定。当然这里的性能取决于许多参数,如模型大小和量化、提示长度、生成的令牌数量和采样策略,所以我们只讨论一般的情况

1 GPU: 44个token /秒,时间:225.5s

2 gpu: 88个token /秒,时间:112.9s

3 gpu: 128个token /秒,时间:77.6s

4 gpu: 137个token /秒,时间:72.7s

5 gpu: 119个token /秒,时间:83.8s

在多GPU上进行批处理

现实世界中,我们可以使用批处理推理来加快速度。这会减少GPU之间的通讯,加快推理速度。我们只需要增加prepare_prompts函数将一批数据而不是单条数据输入到模型即可:

 from accelerate import Acceleratorfrom accelerate.utils import gather_objectfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom statistics import meanimport torch, time, jsonaccelerator = Accelerator()def write_pretty_json(file_path, data):import jsonwith open(file_path, "w") as write_file:json.dump(data, write_file, indent=4)# 10*10 Prompts. Source: https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-booksprompts_all=["The King is dead. Long live the Queen.","Once there were four children whose names were Peter, Susan, Edmund, and Lucy.","The story so far: in the beginning, the universe was created.","It was a bright cold day in April, and the clocks were striking thirteen.","It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.","The sweat wis lashing oafay Sick Boy; he wis trembling.","124 was spiteful. Full of Baby's venom.","As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.","I write this sitting in the kitchen sink.","We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",] * 10# load a base model and tokenizermodel_path="models/llama2-7b"model = AutoModelForCausalLM.from_pretrained(model_path,    device_map={"": accelerator.process_index},torch_dtype=torch.bfloat16,)tokenizer = AutoTokenizer.from_pretrained(model_path)   tokenizer.pad_token = tokenizer.eos_token# batch, left pad (for inference), and tokenizedef prepare_prompts(prompts, tokenizer, batch_size=16):batches=[prompts[i:i + batch_size] for i in range(0, len(prompts), batch_size)]  batches_tok=[]tokenizer.padding_side="left"     for prompt_batch in batches:batches_tok.append(tokenizer(prompt_batch, return_tensors="pt", padding='longest', truncation=False, pad_to_multiple_of=8,add_special_tokens=False).to("cuda") )tokenizer.padding_side="right"return batches_tok# sync GPUs and start the timeraccelerator.wait_for_everyone()    start=time.time()# divide the prompt list onto the available GPUs with accelerator.split_between_processes(prompts_all) as prompts:results=dict(outputs=[], num_tokens=0)# have each GPU do inference in batchesprompt_batches=prepare_prompts(prompts, tokenizer, batch_size=16)for prompts_tokenized in prompt_batches:outputs_tokenized=model.generate(**prompts_tokenized, max_new_tokens=100)# remove prompt from gen. tokensoutputs_tokenized=[ tok_out[len(tok_in):] for tok_in, tok_out in zip(prompts_tokenized["input_ids"], outputs_tokenized) ] # count and decode gen. tokens num_tokens=sum([ len(t) for t in outputs_tokenized ])outputs=tokenizer.batch_decode(outputs_tokenized)# store in results{} to be gathered by accelerateresults["outputs"].extend(outputs)results["num_tokens"] += num_tokensresults=[ results ] # transform to list, otherwise gather_object() will not collect correctly# collect results from all the GPUsresults_gathered=gather_object(results)if accelerator.is_main_process:timediff=time.time()-startnum_tokens=sum([r["num_tokens"] for r in results_gathered ])print(f"tokens/sec: {num_tokens//timediff}, time elapsed: {timediff}, num_tokens {num_tokens}")

可以看到批处理会大大加快速度。

1 GPU: 520 token /sec,时间:19.2s

2 gpu: 900 token /sec,时间:11.1s

3 gpu: 1205个token /秒,时间:8.2s

4 gpu: 1655 token /sec,时间:6.0s

5 gpu: 1658 token /sec,时间:6.0s

总结

截止到本文为止,llama.cpp,ctransformer还不支持多GPU推理,好像llama.cpp在6月有个多GPU的merge,但是我没看到官方更新,所以这里暂时确定不支持多GPU。如果有小伙伴确认可以支持多GPU请留言。

huggingface的Accelerate包则为我们使用多GPU提供了一个很方便的选择,使用多个GPU推理可以显着提高性能,但gpu之间通信的开销随着gpu数量的增加而显著增加。

https://avoid.overfit.cn/post/8210f640cae0404a88fd1c9028c6aabb

作者:Geronimo

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/229230.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于ArcGIS Pro、R、INVEST等多技术融合下生态系统服务权衡与协同动态分析实践应用

生态系统服务是指生态系统所形成的用于维持人类赖以生存和发展的自然环境条件与效用,是人类直接或间接从生态系统中得到的各种惠益。联合国千年生态系统评估(Millennium ecosystem assessment,MA)提出生态系统服务包括供给、调节、…

Day44力扣打卡

打卡记录 给小朋友们分糖果 II(容斥原理 隔板法) 链接 def c2(n):return n * (n - 1) // 2 if n > 1 else 0class Solution:def distributeCandies(self, n: int, limit: int) -> int:return c2(n 2) - 3 * c2(n - limit 1) 3 * c2(n - 2 * …

skywalking告警qq邮箱发送

首先开启发送接收qq邮箱的权限 开启之后&#xff0c;会让你发送信息&#xff0c;按着一系列操作&#xff0c;获得password &#xff08;授权码&#xff08;例如&#xff0c;qq开启SMTP授权码&#xff0c;qq授权码16位&#xff09;&#xff09; <!-- mail邮箱-->…

基于QT的俄罗斯方块游戏设计与实现

基于QT的俄罗斯方块游戏设计与实现 摘要&#xff1a;信息时代正处于高速发展中&#xff0c;而电子游戏已经成为人生活中或不可少的消磨工具之一。科技时代在不断地高速发展中&#xff0c;游戏相关编程设计也随着发展变得越来越重要&#xff0c; 俄罗斯方块游戏是一款古老传遍世…

共享充电宝被取代,共享WIFI项目将成市场趋势!

在创业领域如果有这样一个项目&#xff0c;你会选择哪一个&#xff1f;前者投资十万风险大&#xff0c;后者投资几千风险小。同样需要扫街地推&#xff0c;但产生的利润是相同的。相信100%的人会选择后者。实际上这两个项目前者就是共享电宝&#xff0c;后者就是共享WiFi项目。…

【hive-design】hive架构详解:描述了hive架构,hive主要组件的作用、hsql在hive执行过程中的底层细节、hive各组件作用

文章目录 一. Hive Architecture二. Metastore1. Metastore Architecture2. Metastore Interface 三. Compiler四. hive架构小结 本文主要讨论了 描述了hive架构&#xff0c;hive主要组件的作用详细描述了hsql在hive执行过程中的底层细节描述了hive各组件作用 一. Hive Archite…

U4_2:图论之MST/Prim/Kruskal

文章目录 一、最小生成树-MST生成MST策略一些定义 思路彩蛋 二、普里姆算法&#xff08;Prim算法&#xff09;思路算法流程数据存储分析 伪代码时间复杂度分析 三、克鲁斯卡尔算法&#xff08;Kruskal算法&#xff09;分析算法流程并查集-Find-set 伪代码时间复杂度分析 一、最…

Java 基础学习(三)循环流程控制与数组

1 循环流程控制 1.1 循环流程控制概述 1.1.1 什么是循环流程控制 当一个业务过程需要多次重复执行一个程序单元时&#xff0c;可以使用循环流程控制实现。 Java中包含3种循环结构&#xff1a; 1.2 for循环 1.2.1 for循环基础语法 for循环是最常用的循环流程控制&#xff…

android开发:安卓13Wifi和热点查看与设置功能

近日对安卓热点功能做了一些技术验证&#xff0c;目的是想利用手机开热点给设备做初始化&#xff0c;用的是安卓13&#xff0c;简言之&#xff1a; 热点设置功能不可用&#xff0c;不可设置SSID和密码&#xff0c;不可程序控制开启关闭&#xff0c;网上的代码统统都过时了Loca…

电脑资料删除后如何恢复?3个简单方法轻松恢复文件!

“我平常喜欢在电脑上保存很多学习资料&#xff0c;但由于资料太多&#xff0c;在清理电脑时我可能会误删一些比较有用的资料。想问问大家电脑资料删除后还有机会恢复吗&#xff1f;应该怎么操作呢&#xff1f;” 在数字化时代&#xff0c;很多用户都会选择将重要的文件直接保存…

winform 程序多语言

新建一个winform程序添加资源文件 在多语言的资源文件中设置key以及value设置button根据环境选择语言文件 namespace WindowsFormsMulLang {public partial class Form1 : Form{public Form1(){InitializeComponent();}public static ResourceManager rm new ResourceManager(…

【封装UI组件库系列】封装Button图标组件

封装UI组件库系列第四篇封装Button按钮组件 &#x1f31f;前言 &#x1f31f;封装Button组件 1.分析封装组件所需支持的属性与事件 支持的属性&#xff1a; 支持的事件&#xff1a; 2.创建Button组件 &#x1f31f;封装功能属性 type主题颜色 plain是否朴素 loading等…