昇腾910B多机器部署DeepSeek-R1 实战指南

news/2025/2/26 18:09:24/文章来源:https://www.cnblogs.com/menkeyi/p/18739247

环境准备

硬件配置

  • 节点信息

    • node01: 10.197.61.1, 8 NPU
    • node02: 10.197.61.2, 8 NPU
    • node03: 10.197.61.3, 8 NPU
    • node04: 10.197.61.4, 8 NPU
  • 内存:每台机器约773GB内存

操作系统

操作系统可以是openeuler或者ubuntu,安装包环境基本一致

  • CULinux Enterprise Edition release 3.0 (Telegram)

部署步骤

1. 基础环境配置

在所有节点上执行以下命令,确保系统环境一致:

dnf update -y
dnf install -y bzip2 xz tar p7zip rsync gcc gcc-c++ make kernel-devel elfutils-libelf-devel net-tools docker-ce docker-ce-cli containerd.io

2. 安装昇腾910B驱动

进入驱动目录并安装驱动:

驱动包和固件下载最新,此包华为提供。
cd Ascend-hdk-910b-npu_24.1.0_linux-aarch64/
./install.sh query
./Ascend-hdk-910b-npu-firmware_7.5.0.3.220.run --full

安装完成后,重启机器:

reboot

3. 创建数据目录

在所有节点上创建数据目录:

mkdir /data

4. 配置NPU网络

在每个节点上配置NPU的网络信息,确保所有节点的IP不重复。以下以node01为例:

hccn_tool -i 0 -ip -s address "10.197.90.1.10" netmask 255.255.255.0
hccn_tool -i 0 -gateway -s gateway 10.197.90.254
hccn_tool -i 0 -tls -s enable 0hccn_tool -i 1 -ip -s address "10.197.90.1.11" netmask 255.255.255.0
hccn_tool -i 1 -gateway -s gateway 10.197.90.254
hccn_tool -i 1 -tls -s enable 0# 重复上述命令,配置所有NPU的网络信息

5. 下载并启动Docker镜像

下载DeepSeek-R1的Docker镜像:

docker pull swr.cn-southwest-2.myhuaweicloud.com/ei-mindie/mindie:2.0.T3-800I-A2-py311-openeuler24.03-lts

启动Docker容器:

docker run -itd --privileged --name=deepseek-r1 --net=host \--shm-size 500g \--device=/dev/davinci0 \--device=/dev/davinci1 \--device=/dev/davinci2 \--device=/dev/davinci3 \--device=/dev/davinci4 \--device=/dev/davinci5 \--device=/dev/davinci6 \--device=/dev/davinci7 \--device=/dev/davinci_manager \--device=/dev/hisi_hdc \--device /dev/devmm_svm \-v /usr/local/dcmi:/usr/local/dcmi \-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \-v /usr/local/sbin:/usr/local/sbin \-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \-v /etc/hccn.conf:/etc/hccn.conf \-v /var/log/npu/:/usr/slog \-v /sys/fs/cgroup:/sys/fs/cgroup:ro \-v /data:/workspace \swr.cn-southwest-2.myhuaweicloud.com/ei-mindie/mindie:2.0.T3-800I-A2-py311-openeuler24.03-lts \bash

进入容器:

docker exec -it <container_id> bash

6. 配置环境变量

在容器内配置环境变量,确保每个节点的MIES_CONTAINER_IP设置为当前节点的IP:

export ATB_LLM_HCCL_ENABLE=1
export ATB_LLM_COMM_BACKEND="hccl"
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MIES_CONTAINER_IP=10.197.61.1  # 根据节点修改
export RANKTABLEFILE=/workspace/rank_table.json
export OMP_NUM_THREADS=1
export NPU_MEMORY_FRACTION=0.95

7. 创建集群配置文件

/data目录下创建rank_table.json文件,配置多机多卡的集群信息:

{"version": "1.0","server_count": "4","server_list": [{"server_id": "10.197.61.1","container_ip": "10.197.61.1","device": [{"device_id": "0","device_ip": "10.197.90.10","rank_id": "0"},{"device_id": "1","device_ip": "10.197.90.11","rank_id": "1"},{"device_id": "2","device_ip": "10.197.90.12","rank_id": "2"},{"device_id": "3","device_ip": "10.197.90.13","rank_id": "3"},{"device_id": "4","device_ip": "10.197.90.14","rank_id": "4"},{"device_id": "5","device_ip": "10.197.90.15","rank_id": "5"},{"device_id": "6","device_ip": "10.197.90.16","rank_id": "6"},{"device_id": "7","device_ip": "10.197.90.17","rank_id": "7"}]},{"server_id": "10.197.61.2","container_ip": "10.197.61.2","device": [{"device_id": "0","device_ip": "10.197.90.20","rank_id": "8"},{"device_id": "1","device_ip": "10.197.90.21","rank_id": "9"},{"device_id": "2","device_ip": "10.197.90.22","rank_id": "10"},{"device_id": "3","device_ip": "10.197.90.23","rank_id": "11"},{"device_id": "4","device_ip": "10.197.90.24","rank_id": "12"},{"device_id": "5","device_ip": "10.197.90.25","rank_id": "13"},{"device_id": "6","device_ip": "10.197.90.26","rank_id": "14"},{"device_id": "7","device_ip": "10.197.90.27","rank_id": "15"}]},{"server_id": "10.197.61.3","container_ip": "10.197.61.3","device": [{"device_id": "0","device_ip": "10.197.90.30","rank_id": "16"},{"device_id": "1","device_ip": "10.197.90.31","rank_id": "17"},{"device_id": "2","device_ip": "10.197.90.32","rank_id": "18"},{"device_id": "3","device_ip": "10.197.90.33","rank_id": "19"},{"device_id": "4","device_ip": "10.197.90.34","rank_id": "20"},{"device_id": "5","device_ip": "10.197.90.35","rank_id": "21"},{"device_id": "6","device_ip": "10.197.90.36","rank_id": "22"},{"device_id": "7","device_ip": "10.197.90.37","rank_id": "23"}]},{"server_id": "10.197.61.4","container_ip": "10.197.61.4","device": [{"device_id": "0","device_ip": "10.197.90.40","rank_id": "24"},{"device_id": "1","device_ip": "10.197.90.41","rank_id": "25"},{"device_id": "2","device_ip": "10.197.90.42","rank_id": "26"},{"device_id": "3","device_ip": "10.197.90.43","rank_id": "27"},{"device_id": "4","device_ip": "10.197.90.44","rank_id": "28"},{"device_id": "5","device_ip": "10.197.90.45","rank_id": "29"},{"device_id": "6","device_ip": "10.197.90.46","rank_id": "30"},{"device_id": "7","device_ip": "10.197.90.47","rank_id": "31"}]}],"status": "completed"
}

8. 配置推理框架

/data目录下创建config.json文件,配置推理框架的相关参数:

{"Version" : "1.0.0","LogConfig" : {"logLevel" : "Info","logFileSize" : 20,"logFileNum" : 20,"logPath" : "logs/mindie-server.log"},"ServerConfig" : {"ipAddress" : "10.197.61.1","managementIpAddress" : "10.197.61.1","port" : 1025,"managementPort" : 1026,"metricsPort" : 1027,"allowAllZeroIpListening" : false,"maxLinkNum" : 500,"httpsEnabled" : false,"fullTextEnabled" : false,"tlsCaPath" : "security/ca/","tlsCaFile" : ["ca.pem"],"tlsCert" : "security/certs/server.pem","tlsPk" : "security/keys/server.key.pem","tlsPkPwd" : "security/pass/key_pwd.txt","tlsCrlPath" : "security/certs/","tlsCrlFiles" : ["server_crl.pem"],"managementTlsCaFile" : ["management_ca.pem"],"managementTlsCert" : "security/certs/management/server.pem","managementTlsPk" : "security/keys/management/server.key.pem","managementTlsPkPwd" : "security/pass/management/key_pwd.txt","managementTlsCrlPath" : "security/management/certs/","managementTlsCrlFiles" : ["server_crl.pem"],"kmcKsfMaster" : "tools/pmt/master/ksfa","kmcKsfStandby" : "tools/pmt/standby/ksfb","inferMode" : "standard","interCommTLSEnabled" : false,"interCommPort" : 1121,"interCommTlsCaPath" : "security/grpc/ca/","interCommTlsCaFiles" : ["ca.pem"],"interCommTlsCert" : "security/grpc/certs/server.pem","interCommPk" : "security/grpc/keys/server.key.pem","interCommPkPwd" : "security/grpc/pass/key_pwd.txt","interCommTlsCrlPath" : "security/grpc/certs/","interCommTlsCrlFiles" : ["server_crl.pem"],"openAiSupport" : "vllm"},"BackendConfig" : {"backendName" : "mindieservice_llm_engine","modelInstanceNumber" : 1,"npuDeviceIds" : [[0,1,2,3,4,5,6,7]],"tokenizerProcessNumber" : 8,"multiNodesInferEnabled" : true,"multiNodesInferPort" : 1120,"interNodeTLSEnabled" : false,"interNodeTlsCaPath" : "security/grpc/ca/","interNodeTlsCaFiles" : ["ca.pem"],"interNodeTlsCert" : "security/grpc/certs/server.pem","interNodeTlsPk" : "security/grpc/keys/server.key.pem","interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt","interNodeTlsCrlPath" : "security/grpc/certs/","interNodeTlsCrlFiles" : ["server_crl.pem"],"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa","interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb","ModelDeployConfig" : {"maxSeqLen" : 20000,"maxInputTokenLen" : 20000,"truncation" : true,"ModelConfig" : [{"modelInstanceType" : "Standard","modelName" : "deepseekr1","modelWeightPath" : "/workspace/DeepSeek-R1-bf16-mind/","worldSize" : 8,"cpuMemSize" : 5,"npuMemSize" : -1,"backendType" : "atb","trustRemoteCode" : false}]},"ScheduleConfig" : {"templateType" : "Standard","templateName" : "Standard_LLM","cacheBlockSize" : 128,"maxPrefillBatchSize" : 8,"maxPrefillTokens" : 2024,"prefillTimeMsPerReq" : 150,"prefillPolicyType" : 0,"decodeTimeMsPerReq" : 50,"decodePolicyType" : 0,"maxBatchSize" : 8,"maxIterTimes" : 2024,"maxPreemptCount" : 0,"supportSelectBatch" : false,"maxQueueDelayMicroseconds" : 5000}}
}

config.json文件覆盖到MindIE服务的配置目录:

cp /workspace/config.json /usr/local/Ascend/mindie/latest/mindie-service/conf/

9. 复制模型文件

将DeepSeek-R1模型文件复制到/data目录下,并确保目录名为DeepSeek-R1-bf16-mind

10. 启动服务

在所有节点上按照顺序启动服务:

nohup /usr/local/Ascend/mindie/latest/mindie-service/bin/mindieservice_daemon > /workspace/output.log 2>&1 &

11. 测试接口

通过以下命令测试接口是否正常工作:

curl -X POST http://10.197.61.1:1025/v1/chat/completions \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{"model": "deepseekr1","messages": [{"role": "user","content": "你好"}],"max_tokens": 20,"top_p": 0.95
}'

总结

通过以上步骤,我们成功在四台搭载昇腾910B的服务器上部署了DeepSeek-R1模型,并配置了多机多卡环境。希望本文能为有类似需求的开发者提供参考。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/890215.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

使用JAVA调用asmx服务,“http://tempuri.org/”有什么作用?

原文链接:https://bbs.csdn.net/topics/392507481 这个是域名 http://tempuri.org/ 是默认的命名空间.如果是web直接引用不需要管,但有时候别人发布改了这个命名,你动态引用就需要改成跟他一样的了.一般情况下有些人发布没改这个,有些注重网站安全的就修改了这个,给你个照片看…

为什么去IOE化的背景下,还有必要学Oracle

很多人都知道有“去IOE化”这个口号,但事实上,有多少人知道是哪一年提出的这个口号吗?有多少人知道去的哪个IOE吗?现在越来越多的国产数据库出现,还有必要学Oracle、考OCP认证吗? 去IOE化 “去IOE化”这个口号早在2008、2009的时候就提出来了,原因是互联网发展需要、成本…

08 梯度消失与梯度爆炸问题

由反向传播原理可知,梯度的计算遵循链式法则。由于网络层数不断加深,梯度的连乘效应可能会导致梯度呈指数形式衰减,又或以指数形式增加。 前者叫做梯度消失,梯度消失导致网络中的早期层几乎不更新,使得网络难以学习到输入数据的有效特征。可能导致网络权重更新非常缓慢,使…

GAMES101 作业三

重要知识点一 布林冯反射模型 漫反射+高光+环境光重要知识点二 通过作业也对空间中的坐标变换认识更清晰了一点,在摄像空间中进行变换是不对的,需要从原来的三维空间进行变换才对,所以会有一个矫正系数 重要知识点三 在计算光线时,要注意计算向量和单位化 不了解的 对于后两…

写一个简单的hexo-tag-plugin:quote

前置教程 [Akilarの糖果屋 - Akilar.top](https://akilar.top/posts/e2bf861f/) 为啥想写一个quote的标签外挂 我最近在写博客的时候,发现好多时候原生的Hexo标签不是很好用,效果如下。 {% tabs Hexo Block Quote, -1 %}没有提供参数,则只输出普通的 blockquote{% blockquot…

搭建DeepSeek-R1平台

前言 大家用到 DeepSeek-R1 时应该会经常出现下面的情况。但凡多问两个问题,不但缓慢,而且容易出现服务器繁忙的问题:今天教大家一种通过API部署的方式,可以体验满血版的DeepSeek-R1,不仅回答快速,而且不会出现服务器繁忙的情况。 注册账号 首先大家要通过下面的方式,先…

No.16 CSS--背景属性

一、CSS常见的背景属性 background-color: aqua; 设置背景颜色background-image: none; 设置背景图片background-position: 0%; 设置背景图片位置background-repeat: no-repeat; 设置背景图片如何重复填充background-size: 0%; …

Mybatis基础06

动态SQL 介绍 什么是动态SQL:动态SQL指的是根据不同的查询条件 , 生成不同的Sql语句. 官网描述: MyBatis 的强大特性之一便是它的动态 SQL。如果你有使用 JDBC 或其它类似框架的经验,你就能体会到根据不同条件拼接 SQL 语句的痛苦。例如拼接时要确保不能忘记添加必要的空格,…

Spring5基础01

Spring概述简介Spring : 春天 --->给软件行业带来了春天 2002年,Rod Jahnson首次推出了Spring框架雏形interface21框架。 2004年3月24日,Spring框架以interface21框架为基础,经过重新设计,发布了1.0正式版。 很难想象Rod Johnson的学历 , 他是悉尼大学的博士,然而他的专…

P2375 [NOI2014] 动物园

P2375 [NOI2014] 动物园 题目描述 近日,园长发现动物园中好吃懒做的动物越来越多了。例如企鹅,只会卖萌向游客要吃的。为了整治动物园的不良风气,让动物们凭自己的真才实学向游客要吃的,园长决定开设算法班,让动物们学习算法。 某天,园长给动物们讲解 KMP 算法。 园长:“…

KUKA库卡机器人KR210维修与保养秘籍

在工业自动化领域,KUKA库卡机器人凭借其性能和稳定的运行,成为众多企业的不二选择。然而,再先进的设备也需要定期的进行KUKA库卡机器人KR210维修和KUKA机械手保养,以确保其高效运行。 一、KUKA库卡机器人KR210维修方法包含了定期检查、润滑保养、更换损坏部件、控制柜维护等…

以下是使用:empty 搭配before实现表格中数据为空时的默认展示

以下是使用:empty 搭配before实现表格中数据为空时的默认展示// 在文件.vue中的table<template><!-- 省略其他代码 --><el-table-column prop="title" label="标题"></el-table-column><el-table-column prop="desc"…