背景
网上教程老旧,不适用。
详细步骤
1、安装slurm
sudo apt install slurm-wlm slurm-wlm-doc -y
检查是否安装成功:
slurmd --version
如果得到slurm-wlm 23.11.4
,表明安装成功。
2、配置slurm。
使用命令:
sudo vi /etc/slurm/slurm.conf
在其中输入以下内容:
ClusterName=cool [自定义集群名称]
ControlMachine=master
#ControlAddr=
#BackupController=
#BackupAddr=
#
MailProg=/usr/bin/s-nail
SlurmUser=slurm
#SlurmdUser=slurm
SlurmctldPort=6817SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SelectType=select/linear
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGINGSlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODESPartitionName=CPU Nodes=master Default=NO MaxTime=INFINITE State=UP
#NodeName=master State=UNKNOWN
NodeName=master Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
其中,要修改以下参数,请勿和上述配置完全一样;
ControlMachine=你的主机名,查看方法hostname
PartitionName=队列名称,可以自己起,比如改为CPU
Nodes=你的主机名,查看方法hostname
NodeName=你的主机名,查看方法hostname
Sockets=你服务器cpu的个数,查看方法cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
CoresPerSocket=每个cpu的核数,查看方法cat /proc/cpuinfo| grep "cpu cores"| uniq
ThreadsPerCore填写方法:
运行下面的脚本;
#!/bin/bash
cpunum=`cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l`
echo "CPU 个数: $cpunum";
cpuhx=`cat /proc/cpuinfo | grep "cores" | uniq | awk -F":" '{print $2}'`
echo "CPU 核心数:$cpuhx" ;
cpuxc=`cat /proc/cpuinfo | grep "processor" | wc -l`
echo "CPU 线程数:$cpuxc" ;if [[ `expr $cpunum\*$[cpuhx*2] ` -eq $cpuxc ]];
thenecho "开启了超线程"
elseecho "未开启超线程"
fi
如果开启了超线程填2,否则填1.
3、创建文件夹。使用以下命令,创建所需的文件夹:
sudo mkdir -p /var/spool/slurmd
sudo mkdir -p /var/spool/slurmctld
sudo chown -R slurm:slurm /var/spool/slurmd
sudo chown -R slurm:slurm /var/spool/slurmctld
sudo chmod -R 755 /var/spool/slurmd
sudo chmod -R 755 /var/spool/slurmctld
4、启动slurm
sudo systemctl enable slurmctld --now
sudo systemctl enable slurmd --now
5、确保节点状态初始化
sudo scontrol update NodeName=ubuntuseerver State=RESUME
6、测试是否成功
srun --partition=CPU --time=00:01:00 --ntasks=1 hostname
如果输出主机名则证明成功。
报错处理
1、如果在启动服务的时候报错,重复执行以下内容;
sudo chmod -R 755 /var/spool/slurmd
sudo chmod -R 755 /var/spool/slurmctld
然后重新启动服务
sudo systemctl restart slurmd
sudo systemctl restart slurmctld
其他报错,欢迎联系作者询问。
备注
不同Ubuntu可能有所不同,本文适用于Ubuntu 24.04
参考资料
https://wxyhgk.com/article/ubuntu-slurm