继上次分享intel编译器套件编译vasp6.5.0,本次尝试来使用AMD CPU/NVIDIA GPU编译VASP,硬件使用的是某超算。
(AMD EPYC核心多,Yes!
(编译环境配置复杂,有点不Yes了
服务器软硬件概要:
- CPU:双路AMD EPYC-Milan 7713 (64核/CPU,共128核/节点)
- GPU:NVIDIA A100 40 GB
- 操作系统:RHEL 8.4
- 软件环境:HPE Cray
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC-Milan Processor
Stepping: 1
CPU MHz: 1996.250
BogoMIPS: 3992.50
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-127
Flags: ...$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
Cray默认环境:
$ module list
Currently Loaded Modulefiles:1) craype-x86-rome 5) cce/13.0.2 9) cray-libsci/21.08.1.22) libfabric/1.11.0.4.125 6) craype/2.7.15 10) cray-pals/1.1.63) craype-network-ofi 7) cray-dsmml/0.2.2 11) PrgEnv-cray/8.3.34) perftools-base/22.04.0 8) cray-mpich/8.1.15
重要提示:
- 在HPE Cray环境中,所有的fortran编译器均被包装为
ftn
;C编译器为cc
(小写的cc),C++编译器为CC
(大写的CC) - 每一类硬件/软件相关的环境被模块化为各个module,模块名一般叫
PrgEnv-xxx
,xxx可以是cray
,intel
,gnu
,aocc
,nvhpc
,默认是PrgEnv-cray
。
1. AMD CPU + Intel OneAPI 版本
编译环境参考了超算管理员给的guides文件。
1.1 编译环境加载
- 编译器:
Intel OneAPI 2024.0
- 数学库:Intel OneAPI 2024.0里的MKL
- MPI:
Cray MPICH 8.1.15
- I/O增强:Intel编译器编译的HDF5 (parallel version=1.12.1.1)
$ module swap PrgEnv-cray PrgEnv-intel
$ module swap craype-x86-rome craype-x86-milan
$ module load mkl/2024.0
$ module load cray-hdf5-parallel
$ module rm cray-libsci
$ module list
Currently Loaded Modulefiles:1) craype-x86-milan 5) intel/2024.0 9) cray-pals/1.1.62) libfabric/1.11.0.4.125 6) craype/2.7.15 10) PrgEnv-intel/8.3.33) craype-network-ofi 7) cray-dsmml/0.2.2 11) mkl/2024.04) perftools-base/22.04.0 8) cray-mpich/8.1.15 12) cray-hdf5-parallel/1.12.1.1$ cp arch/makefile.include.oneapi_omp makefile.include
1.2 makefile.include修改
复制的makefile.include模板不是.aocc
后缀的,而是.oneapi
。没有测试过老编译器(如<=2023的OneAPI或者Parallel Studio XE)编译新VASP6.5.0的运行性能如何,读者可以自行测试。一般期望的是:新软件搭配新编译器。
我这里使用的是OneAPI+OpenMP组合arch/makefile.include.oneapi_omp
,主要修改的内容如下:
- 第2行:-DHOST的值我改为
AMDIFC
(可选)。 - 在第8行:加入
-Duse_bse_te \
,打开BSE triplet excitation的支持(可选)。 - 第15-16行(行号取决于读者自己的文件):Fortran编译器
FC
和链接器FCL
的值中mpiifort -fc=ifx
替换为ftn
(必须,强制)。 - 以上两行,添加
-diag-disable=10448
这个选项来屏蔽Intel® Fortran Compiler Classic (ifort) 即将被弃用的警告(可选),参见Intel® Fortran Compiler Release Notes:
Support Removed
- Intel® Fortran Compiler Classic (ifort) is now discontinued in oneAPI 2025 release.
- 第29行:
CC_LIB
的值改为cc
,即HPE Cray环境中封装的C编译器。 - 第37行:
CXX_PARS
的值改为CC
,即HPE Cray环境中封装的C++编译器。 - 第48行:注释掉
VASP_TARGET_CPU ?= -xHOST
这一行;或者将其改为VASP_TARGET_CPU ?= -march=core-avx2
,如第49行所示(必须,强制)。这应该是AMD和Intel CPU之间的一些指令集差异,参见Problem of installation of vasp632 with intel oneapi compiler.。 - 第60-63行:取消注释,打开HDF5的支持(可选),注意所使用的HDF5必须是和拿来编译VASP的是同一个系列的,并向下兼容,否则会报错:例如,GCC编译的HDF5 + Intel OneAPI编译VASP会报错;但是老版Intel OneAPI编译的HDF5 + 新版OneAPI编译VASP则可以。并且,所安装的HDF5安装根目录要指向
HDF5_ROOT
这个环境变量或者手动将其改为正确的路径。
# Default precompiler options, ! revised from arch/makefile.include.oneapi_omp
CPP_OPTIONS = -DHOST=\"AMDIFC\" \-DMPI -DMPI_BLOCK=8000 -Duse_collective \-DscaLAPACK \-DCACHE_SIZE=4000 \-Davoidalloc \-Dvasp6 \-Duse_bse_te \-Dtbdyn \-Dfock_dblbuf \-D_OPENMPCPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)FC = ftn -qopenmp -diag-disable=10448
FCL = ftn -diag-disable=10448FREE = -free -names lowercaseFFLAGS = -assume byterecl -wOFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc #icx
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)OBJECTS_LIB = linpack_double.o# For the parser library
CXX_PARS = CC #icpx
LLIBS = -lstdc++##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
### When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
#VASP_TARGET_CPU ?= -xHOST
#VASP_TARGET_CPU ?= -march=core-avx2
#FFLAGS += $(VASP_TARGET_CPU)# Intel MKL (FFTW, BLAS, LAPACK, and scaLAPACK)
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL += -qmkl
MKLROOT ?= /path/to/your/mkl/installation
LLIBS += -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS =-I$(MKLROOT)/include/fftw# HDF5-support (optional but strongly recommended, and mandatory for some features)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT ?= /path/to/your/hdf5/installation
LLIBS += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS += -I$(HDF5_ROOT)/include# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS += -L$(WANNIER90_ROOT)/lib -lwannier# For the fftlib library (hardly any benefit in combination with MKL's FFTs)
#FCL = mpiifort fftlib.o -qmkl
#CXX_FFTLIB = icpc -qopenmp -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS += fftlib# For machine learning library vaspml (experimental)
#CPP_OPTIONS += -Dlibvaspml
#CPP_OPTIONS += -DVASPML_USE_CBLAS
#CPP_OPTIONS += -DVASPML_USE_MKL
#CPP_OPTIONS += -DVASPML_DEBUG_LEVEL=3
#CXX_ML = mpiicpc -cxx=icpx -qopenmp
#CXXFLAGS_ML = -O3 -std=c++17 -Wall
#INCLUDE_ML =
1.3 编译
在登录节点上编译,每个用户被限制了4个核心。
注意加上DEPS=1
指定编译的文件依赖,否则并行编译会报错。
$ make DEPS=1 -j4 all
......
$ ls bin/
vasp_gam vasp_ncl vasp_std
2. AMD CPU + NVIDIA A100 GPU版本
编译环境参考了超算管理员给的guides文件。
注意:用户需要在有显卡硬件和驱动的节点上编译(即能找到nvidia-smi
这个命令)。否则在编译到需要GPU硬件的代码时,会报错libcuda.so.1 not found
。解决办法:先在CPU上编译,在报错之后再登录到GPU节点上继续编译,这样可以节省一些宝贵的机时。
2.1 编译环境加载
- 编译器套件:
NVHPC 23.7
- CUDA:11.8
- 数学库:Intel OneAPI 2024.0里的MKL
- MPI:
Cray MPICH 8.1.15
- I/O增强:对应HVHPC编译的HDF5 (parallel version=1.12.1)
$ module swap PrgEnv-cray PrgEnv-nvhpc
$ module swap craype-x86-rome craype-x86-milan
$ module load craype-accel-nvidia80
$ module swap nvhpc nvhpc/23.7
$ module swap cuda cuda/11.8.0
$ module rm cray-libsci # cray-libsci may intefere with math libs
$ module load hdf5/1.12.1-nvhpc
$ module load mkl/2024.0
$ module list
Currently Loaded Modulefiles:1) craype-x86-milan 6) craype/2.7.15 11) cuda/11.8.02) libfabric/1.11.0.4.125 7) cray-dsmml/0.2.2 12) craype-accel-nvidia803) craype-network-ofi 8) cray-mpich/8.1.15 13) hdf5/1.12.1-nvhpc4) perftools-base/22.04.0 9) cray-pals/1.1.6 14) mkl/2024.05) nvhpc/23.7 10) PrgEnv-nvhpc/8.3.3$ cp arch/makefile.include.nvhpc_ompi_mkl_omp_acc makefile.include
2.2 makefile.include修改
复制模板arch/makefile.include.nvhpc_ompi_mkl_omp_acc
,主要修改的内容如下:
- 第2行:-DHOST的值我改为
LinuxNVGPU
(可选)。 - 在第8行:加入
-Duse_bse_te \
,打开BSE triplet excitation的支持(可选)。 - 第21-23行(行号取决于读者自己的文件):C编译器
CC
从mpicc
改为cc
;Fortran编译器FC
和链接器FCL
的值中mpif90
替换为ftn
,然后根据自己的GPU架构和CUDA版本修改-gpu=
(必须,强制)
-gpu=
指定GPU的物理架构和CUDA版本:我所使用的GPU是A100为安培架构,代码cc80
, CUDA版本为11.8。
Pascal: cc60
(e.g., Tesla P100, GTX 1080)
Volta: cc70
(e.g., Tesla V100)
Turing: cc75
(e.g., RTX 2080)
Ampere: cc80
(e.g., A100, RTX 3080)
所以我的-gpu=cc80,cuda11.8
。
-
第50行:
nvc++
改为CC
-
关于MKL:
方案1:注释掉原有的 MKLLIBS和它下面的LLIBS_MKL;改为只使用一个
LLIBS_MKL = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
方案2:取消#MKLLIBS = -Mmkl的注释;然后在它的下一行中的
MKLLIBS =
改成MKLLIBS +=
(加上一个加号);接着修改在下面的LLIBS_MKL = -L$(MKLROOT)/lib -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 $(MKLLIBS)
中的-lmkl_blacs_openmpi_lp64
为-lmkl_blacs_intelmpi_lp64
(openmpi --> intelmpi)不做修改可能会出现未定义引用错误
libmkl_blacs_openmpi_lp64.so: undefined reference
-
第105-108行:取消注释,打开HDF5的支持(可选)
# Default precompiler options, ! revised from arch/makefile.include.nvhpc_ompi_mkl_omp_acc
CPP_OPTIONS = -DHOST=\"LinuxNVGPU\" \-DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \-DscaLAPACK \-DCACHE_SIZE=4000 \-Davoidalloc \-Dvasp6 \-Duse_bse_te \-Dtbdyn \-Dqd_emulate \-Dfock_dblbuf \-D_OPENMP \-DACC_OFFLOAD \-DNVCUDA \-DUSENCCLCPP = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX) > $*$(SUFFIX)# N.B.: you might need to change the cuda-version here
# to one that comes with your NVIDIA-HPC SDK
CC = cc -acc -gpu=cc80,cuda11.8 -mp
FC = ftn -acc -gpu=cc80,cuda11.8 -mp
FCL = ftn -acc -gpu=cc80,cuda11.8 -mp -c++libsFREE = -MfreeFFLAGS = -Mbackslash -Mlarge_arraysOFLAG = -fastDEBUG = -Mfree -O0 -tracebackLLIBS = -cudalib=cublas,cusolver,cufft,nccl -cuda# Redefine the standard list of O1 and O2 objects
SOURCE_O1 := pade_fit.o minimax_dependence.o wave_window.o
SOURCE_O2 := pead.o# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = $(CC)
CFLAGS_LIB = -O -w
FFLAGS_LIB = -O1 -Mfixed
FREE_LIB = $(FREE)OBJECTS_LIB = linpack_double.o# For the parser library
CXX_PARS = CC --no_warnings #nvc++ --no_warnings##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp host
FFLAGS += $(VASP_TARGET_CPU)# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')# If the above fails, then NVROOT needs to be set manually
#NVHPC ?= /opt/nvidia/hpc_sdk
#NVVERSION = 21.11
#NVROOT = $(NVHPC)/Linux_x86_64/$(NVVERSION)## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
#OFLAG_IN = -fast -Mwarperf
#SOURCE_IN := nonlr.o# Software emulation of quadruple precsion (mandatory)
QD ?= $(NVROOT)/compilers/extras/qd
LLIBS += -L$(QD)/lib -lqdmod -lqd
INCS += -I$(QD)/include/qd# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
MKLROOT ?= /path/to/your/mkl/installation
#MKLLIBS = -Mmkl
#MKLLIBS += -lmkl_intel_lp64 -lmkl_pgi_thread -lmkl_core -pgf90libs -mp -lpthread -lm -ldl
LLIBS_MKL = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS += -I$(MKLROOT)/include/fftw# If you want to use scaLAPACK from MKL
#LLIBS_MKL = -L$(MKLROOT)/lib -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 $(MKLLIBS)# Use a separate scaLAPACK installation (optional but recommended in combination with OpenMPI)
# Comment out the two lines below if you want to use scaLAPACK from MKL instead
#SCALAPACK_ROOT ?= /path/to/your/scalapack/installation
#LLIBS_MKL = -L$(SCALAPACK_ROOT)/lib -lscalapack $(MKLLIBS)LLIBS += $(LLIBS_MKL)INCS += -I$(MKLROOT)/include/fftw# Use cusolvermp (optional)
# supported as of NVHPC-SDK 24.1 (and needs CUDA-11.8)
#CPP_OPTIONS+= -DCUSOLVERMP -DCUBLASMP
#LLIBS += -cudalib=cusolvermp,cublasmp -lnvhpcwrapcal# HDF5-support (optional but strongly recommended, and mandatory for some features)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT ?= /path/to/your/hdf5/installation
LLIBS += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS += -I$(HDF5_ROOT)/include# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS += -L$(WANNIER90_ROOT)/lib -lwannier# For the fftlib library (hardly any benefit for the OpenACC GPU port, especially in combination with MKL's FFTs)
#CPP_OPTIONS+= -Dsysv
#FCL += fftlib.o
#CXX_FFTLIB = nvc++ -mp --no_warnings -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS += fftlib
#LLIBS += -ldl# For machine learning library vaspml (experimental)
#CPP_OPTIONS += -Dlibvaspml
#CPP_OPTIONS += -DVASPML_USE_CBLAS
#CPP_OPTIONS += -DVASPML_DEBUG_LEVEL=3
#CXX_ML = mpic++ -mp
#CXXFLAGS_ML = -O3 -std=c++17 -Wall -Wextra
#INCLUDE_ML =
2.3 编译
在登录节点上编译,然后遇到libcuda.so.1 not found
后请求到GPU节点,重新加载编译环境后继续编译。
$ make DEPS=1 -j4 all
... 报错
$ qsub -I ...#请求一个交互式任务
$ nvidia-smi
$ #需要重新加载编译环境
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| N/A 41C P0 55W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
$ make DEPS=1 -j16 all
...可能有报错(见下文)
2.4 报错解决
OpenACC (如果makefile.include中CPP_OPTIONS使用了指令-D_OPENACC
)或者NVCUDA(如果makefile.include中CPP_OPTIONS使用了指令-DNVCUDA
) 导致MPIX_Query_cuda_support
相关报错。报错时会提示对应的文件和行号,该错误一般是一下三个文件(对于gam/std/ncl都一样):./build/{gam,std,ncl}/{openacc,nvcuda}.f90
解决办法:
vim +行号 打开./build/{gam,std,ncl}/{openacc,nvcuda}.F,注释掉下面这4行(行首加入感叹号)
! INTERFACE
! INTEGER(c_int) FUNCTION MPIX_Query_cuda_support() BIND(C, name="MPIX_Query_cuda_support")
! END FUNCTION
! END INTERFACE
并修改其下方CUDA_AWARE_SUPPORT = MPIX_Query_cuda_support() == 1
为CUDA_AWARE_SUPPORT = .TRUE.
接着继续编译。
参考:Error: undefined reference to `MPIX_Query_cuda_support'
2.5 测试
单层石墨烯的SCF
INCAR:
SYSTEM = grapheneISTART = 0; ICHARG = 2ENCUT = 520ISIF = 3ISMEAR = -5 ; SIGMA = 0.05ALGO = Fast
# NPAR = 3
#########EDIFF = 1E-7PREC = AccurateEDIFFG = -0.01
##########ISPIN = 2#MAGMOM =LCHARG = .TRUE.LWAVE = .TRUE.LORBIT = 11LREAL = .FALSE.
#########SYMPREC = 1E-4ISYM = 1NELM = 200
#########NSW = 0POTIM = 0.5IBRION = -1
#########VDW=DFT-D2#LVDW = .TRUE.#IVDW = 1
KPOINTS:
K-POINTS0
Gamma-Centered25 25 10 0 0
POSCAR: 注意,晶格矩阵中有一点浮点数的误差,仅供测试。
graphene1.000000000000002.4677557588200547 0.0000000001951262 -0.0000000000000000-1.2338785942720587 2.1371404153443971 -0.00000000000000000.0000000000000000 0.0000000000000000 14.9975103391044442C2
Direct0.3333328829999971 0.6666671669999999 0.20000000600000330.6666671540000024 0.3333328579999986 0.2000000060000033
$ export OMP_NUM_THREADS=16 # numer of CPU cores
$ mpirun -np 1 --cpu-bind depth -d $OMP_NUM_THREADS vasp_std | tee vasp_run.outrunning 1 mpi-ranks, with 16 threads/rank, on 1 nodesdistrk: each k-point on 1 cores, 1 groupsdistr: one band on 1 cores, 1 groupsOffloading initialized ... 1 GPUs detectedvasp.6.5.0 16Dec24 (build ?? 2025 ??) complexPOSCAR found type information on POSCAR CPOSCAR found : 1 types and 2 ionsReading from existing POTCARscaLAPACK will be used selectively (only on CPU)Reading from existing POTCARLDA part: xc-table for (Slater+PW92), standard interpolationPOSCAR, INCAR and KPOINTS ok, starting setupFFT: planning ... GRIDCFFT: planning ... GRID_SOFTFFT: planning ... GRIDWAVECAR not readentering main loop$ head OUTCARvasp.6.5.0 16Dec24 (build ??) complexexecuted on LinuxNVGPU date 2025 ??running 1 mpi-ranks, with 16 threads/rank, on 1 nodesdistrk: each k-point on 1 cores, 1 groupsdistr: one band on NCORE= 1 cores, 1 groupsOffloading initialized ... 1 GPUs detected$ $ tail -14 OUTCARGeneral timing and accounting informations for this job:========================================================Total CPU time used (sec): 26.150User time (sec): 25.026System time (sec): 1.125Elapsed time (sec): 25.703Maximum memory used (kb): 1377240.Average memory used (kb): N/AMinor page faults: 263714Major page faults: 0Voluntary context switches: 18839
能成功检测到GPU并在GPU上运行SCF计算。
3. 结束语
注意:上述将Fortran/C/C++编译器的名字统一改为ftn,cc,CC仅适用于HPE Cray超算环境。
module环境的添加可以仿照之前的帖子:VASP6.5.0+Intel CPU编译并添加module环境,注意设置加载好所需环境依赖即可。
本文到此,后面有空更新在自有集群上VASP6.5.0+Intel CPU+NVIDIA A40 GPU的编译经验。
转载请注明出处。
欢迎交流。
PS:请不要私信或者回帖索要VASP源码,我不会回复此类要求。请记住VASP是商业软件,谢谢 😃