VASP6.5.0+AMD CPU/NVIDIA A100 GPU编译-编程知识

VASP6.5.0+AMD CPU/NVIDIA A100 GPU编译

news/2025/2/23 2:10:02/文章来源:https://www.cnblogs.com/Tosykie/p/18731759

继上次分享intel编译器套件编译vasp6.5.0，本次尝试来使用AMD CPU/NVIDIA GPU编译VASP，硬件使用的是某超算。

（AMD EPYC核心多，Yes！

（编译环境配置复杂，有点不Yes了

服务器软硬件概要：

CPU：双路AMD EPYC-Milan 7713 (64核/CPU，共128核/节点)
GPU：NVIDIA A100 40 GB
操作系统：RHEL 8.4
软件环境：HPE Cray

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  1
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC-Milan Processor
Stepping:            1
CPU MHz:             1996.250
BogoMIPS:            3992.50
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-127
Flags:               ...$ cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)

Cray默认环境：

$ module list
Currently Loaded Modulefiles:1) craype-x86-rome          5) cce/13.0.2               9) cray-libsci/21.08.1.22) libfabric/1.11.0.4.125   6) craype/2.7.15           10) cray-pals/1.1.63) craype-network-ofi       7) cray-dsmml/0.2.2        11) PrgEnv-cray/8.3.34) perftools-base/22.04.0   8) cray-mpich/8.1.15

重要提示：

在HPE Cray环境中，所有的fortran编译器均被包装为ftn；C编译器为cc（小写的cc），C++编译器为CC （大写的CC）
每一类硬件/软件相关的环境被模块化为各个module，模块名一般叫PrgEnv-xxx，xxx可以是cray，intel，gnu，aocc，nvhpc，默认是PrgEnv-cray。

1. AMD CPU + Intel OneAPI 版本

编译环境参考了超算管理员给的guides文件。

1.1 编译环境加载

编译器：Intel OneAPI 2024.0
数学库：Intel OneAPI 2024.0里的MKL
MPI：Cray MPICH 8.1.15
I/O增强：Intel编译器编译的HDF5 (parallel version=1.12.1.1)

$ module swap PrgEnv-cray PrgEnv-intel
$ module swap craype-x86-rome craype-x86-milan
$ module load mkl/2024.0
$ module load cray-hdf5-parallel
$ module rm cray-libsci
$ module list
Currently Loaded Modulefiles:1) craype-x86-milan              5) intel/2024.0                  9) cray-pals/1.1.62) libfabric/1.11.0.4.125        6) craype/2.7.15                10) PrgEnv-intel/8.3.33) craype-network-ofi            7) cray-dsmml/0.2.2             11) mkl/2024.04) perftools-base/22.04.0        8) cray-mpich/8.1.15            12) cray-hdf5-parallel/1.12.1.1$ cp arch/makefile.include.oneapi_omp makefile.include

1.2 makefile.include修改

复制的makefile.include模板不是.aocc后缀的，而是.oneapi。没有测试过老编译器（如<=2023的OneAPI或者Parallel Studio XE）编译新VASP6.5.0的运行性能如何，读者可以自行测试。一般期望的是：新软件搭配新编译器。

我这里使用的是OneAPI+OpenMP组合arch/makefile.include.oneapi_omp，主要修改的内容如下：

第2行：-DHOST的值我改为AMDIFC（可选）。
在第8行：加入-Duse_bse_te \，打开BSE triplet excitation的支持（可选）。
第15-16行（行号取决于读者自己的文件）：Fortran编译器FC和链接器FCL的值中mpiifort -fc=ifx替换为ftn（必须，强制）。
以上两行，添加-diag-disable=10448这个选项来屏蔽Intel® Fortran Compiler Classic (ifort) 即将被弃用的警告（可选），参见Intel® Fortran Compiler Release Notes：

Support Removed

Intel® Fortran Compiler Classic (ifort) is now discontinued in oneAPI 2025 release.

第29行：CC_LIB的值改为cc，即HPE Cray环境中封装的C编译器。
第37行：CXX_PARS的值改为CC，即HPE Cray环境中封装的C++编译器。
第48行：注释掉VASP_TARGET_CPU ?= -xHOST这一行；或者将其改为VASP_TARGET_CPU ?= -march=core-avx2，如第49行所示（必须，强制）。这应该是AMD和Intel CPU之间的一些指令集差异，参见Problem of installation of vasp632 with intel oneapi compiler.。
第60-63行：取消注释，打开HDF5的支持（可选），注意所使用的HDF5必须是和拿来编译VASP的是同一个系列的，并向下兼容，否则会报错：例如，GCC编译的HDF5 + Intel OneAPI编译VASP会报错；但是老版Intel OneAPI编译的HDF5 + 新版OneAPI编译VASP则可以。并且，所安装的HDF5安装根目录要指向HDF5_ROOT这个环境变量或者手动将其改为正确的路径。

# Default precompiler options, ! revised from arch/makefile.include.oneapi_omp
CPP_OPTIONS = -DHOST=\"AMDIFC\" \-DMPI -DMPI_BLOCK=8000 -Duse_collective \-DscaLAPACK \-DCACHE_SIZE=4000 \-Davoidalloc \-Dvasp6 \-Duse_bse_te \-Dtbdyn \-Dfock_dblbuf \-D_OPENMPCPP         = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)FC          = ftn -qopenmp -diag-disable=10448
FCL         = ftn -diag-disable=10448FREE        = -free -names lowercaseFFLAGS      = -assume byterecl -wOFLAG       = -O2
OFLAG_IN    = $(OFLAG)
DEBUG       = -O0# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = cc #icx
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1
FREE_LIB    = $(FREE)OBJECTS_LIB = linpack_double.o# For the parser library
CXX_PARS    = CC #icpx
LLIBS       = -lstdc++##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
### When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
#VASP_TARGET_CPU ?= -xHOST
#VASP_TARGET_CPU ?= -march=core-avx2
#FFLAGS     += $(VASP_TARGET_CPU)# Intel MKL (FFTW, BLAS, LAPACK, and scaLAPACK)
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL        += -qmkl
MKLROOT    ?= /path/to/your/mkl/installation
LLIBS      += -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS        =-I$(MKLROOT)/include/fftw# HDF5-support (optional but strongly recommended, and mandatory for some features)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT  ?= /path/to/your/hdf5/installation
LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS       += -I$(HDF5_ROOT)/include# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS    += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS          += -L$(WANNIER90_ROOT)/lib -lwannier# For the fftlib library (hardly any benefit in combination with MKL's FFTs)
#FCL         = mpiifort fftlib.o -qmkl
#CXX_FFTLIB  = icpc -qopenmp -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS       += fftlib# For machine learning library vaspml (experimental)
#CPP_OPTIONS += -Dlibvaspml
#CPP_OPTIONS += -DVASPML_USE_CBLAS
#CPP_OPTIONS += -DVASPML_USE_MKL
#CPP_OPTIONS += -DVASPML_DEBUG_LEVEL=3
#CXX_ML      = mpiicpc -cxx=icpx -qopenmp
#CXXFLAGS_ML = -O3 -std=c++17 -Wall
#INCLUDE_ML  =

1.3 编译

在登录节点上编译，每个用户被限制了4个核心。

注意加上DEPS=1指定编译的文件依赖，否则并行编译会报错。

$ make DEPS=1 -j4 all
......
$ ls bin/
vasp_gam  vasp_ncl  vasp_std

2. AMD CPU + NVIDIA A100 GPU版本

编译环境参考了超算管理员给的guides文件。

注意：用户需要在有显卡硬件和驱动的节点上编译（即能找到nvidia-smi这个命令）。否则在编译到需要GPU硬件的代码时，会报错libcuda.so.1 not found。解决办法：先在CPU上编译，在报错之后再登录到GPU节点上继续编译，这样可以节省一些宝贵的机时。

2.1 编译环境加载

编译器套件：NVHPC 23.7
CUDA：11.8
数学库：Intel OneAPI 2024.0里的MKL
MPI：Cray MPICH 8.1.15
I/O增强：对应HVHPC编译的HDF5 (parallel version=1.12.1)

$ module swap PrgEnv-cray PrgEnv-nvhpc
$ module swap craype-x86-rome craype-x86-milan
$ module load craype-accel-nvidia80
$ module swap nvhpc nvhpc/23.7
$ module swap cuda cuda/11.8.0
$ module rm cray-libsci # cray-libsci may intefere with math libs
$ module load hdf5/1.12.1-nvhpc
$ module load mkl/2024.0
$ module list
Currently Loaded Modulefiles:1) craype-x86-milan         6) craype/2.7.15           11) cuda/11.8.02) libfabric/1.11.0.4.125   7) cray-dsmml/0.2.2        12) craype-accel-nvidia803) craype-network-ofi       8) cray-mpich/8.1.15       13) hdf5/1.12.1-nvhpc4) perftools-base/22.04.0   9) cray-pals/1.1.6         14) mkl/2024.05) nvhpc/23.7              10) PrgEnv-nvhpc/8.3.3$ cp arch/makefile.include.nvhpc_ompi_mkl_omp_acc makefile.include

2.2 makefile.include修改

复制模板arch/makefile.include.nvhpc_ompi_mkl_omp_acc，主要修改的内容如下：

第2行：-DHOST的值我改为LinuxNVGPU（可选）。
在第8行：加入-Duse_bse_te \，打开BSE triplet excitation的支持（可选）。
第21-23行（行号取决于读者自己的文件）：C编译器CC从mpicc改为cc；Fortran编译器FC和链接器FCL的值中mpif90替换为ftn，然后根据自己的GPU架构和CUDA版本修改-gpu=（必须，强制）

-gpu=指定GPU的物理架构和CUDA版本：我所使用的GPU是A100为安培架构，代码cc80, CUDA版本为11.8。

Pascal: cc60 (e.g., Tesla P100, GTX 1080)

Volta: cc70 (e.g., Tesla V100)

Turing: cc75 (e.g., RTX 2080)

Ampere: cc80 (e.g., A100, RTX 3080)

所以我的-gpu=cc80,cuda11.8。

第50行：nvc++改为CC
关于MKL：

方案1：注释掉原有的 MKLLIBS和它下面的LLIBS_MKL；改为只使用一个

LLIBS_MKL = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64

方案2：取消#MKLLIBS = -Mmkl的注释；然后在它的下一行中的MKLLIBS =改成MKLLIBS += （加上一个加号）；接着修改在下面的LLIBS_MKL = -L$(MKLROOT)/lib -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 $(MKLLIBS)中的-lmkl_blacs_openmpi_lp64为-lmkl_blacs_intelmpi_lp64（openmpi --> intelmpi）

不做修改可能会出现未定义引用错误libmkl_blacs_openmpi_lp64.so: undefined reference
第105-108行：取消注释，打开HDF5的支持（可选）

# Default precompiler options, ! revised from arch/makefile.include.nvhpc_ompi_mkl_omp_acc
CPP_OPTIONS = -DHOST=\"LinuxNVGPU\" \-DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \-DscaLAPACK \-DCACHE_SIZE=4000 \-Davoidalloc \-Dvasp6 \-Duse_bse_te \-Dtbdyn \-Dqd_emulate \-Dfock_dblbuf \-D_OPENMP \-DACC_OFFLOAD \-DNVCUDA \-DUSENCCLCPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)# N.B.: you might need to change the cuda-version here
#       to one that comes with your NVIDIA-HPC SDK
CC          = cc  -acc -gpu=cc80,cuda11.8 -mp
FC          = ftn -acc -gpu=cc80,cuda11.8 -mp
FCL         = ftn -acc -gpu=cc80,cuda11.8 -mp -c++libsFREE        = -MfreeFFLAGS      = -Mbackslash -Mlarge_arraysOFLAG       = -fastDEBUG       = -Mfree -O0 -tracebackLLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o wave_window.o
SOURCE_O2  := pead.o# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = $(CC)
CFLAGS_LIB  = -O -w
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)OBJECTS_LIB = linpack_double.o# For the parser library
CXX_PARS    = CC --no_warnings #nvc++ --no_warnings##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp host
FFLAGS     += $(VASP_TARGET_CPU)# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')# If the above fails, then NVROOT needs to be set manually
#NVHPC      ?= /opt/nvidia/hpc_sdk
#NVVERSION   = 21.11
#NVROOT      = $(NVHPC)/Linux_x86_64/$(NVVERSION)## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
#OFLAG_IN   = -fast -Mwarperf
#SOURCE_IN  := nonlr.o# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd# Intel MKL for FFTW, BLAS, LAPACK, and scaLAPACK
MKLROOT    ?= /path/to/your/mkl/installation
#MKLLIBS     = -Mmkl
#MKLLIBS    += -lmkl_intel_lp64 -lmkl_pgi_thread -lmkl_core -pgf90libs -mp -lpthread -lm -ldl
LLIBS_MKL   = -Mmkl -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
INCS       += -I$(MKLROOT)/include/fftw# If you want to use scaLAPACK from MKL
#LLIBS_MKL   = -L$(MKLROOT)/lib -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 $(MKLLIBS)# Use a separate scaLAPACK installation (optional but recommended in combination with OpenMPI)
# Comment out the two lines below if you want to use scaLAPACK from MKL instead
#SCALAPACK_ROOT ?= /path/to/your/scalapack/installation
#LLIBS_MKL   = -L$(SCALAPACK_ROOT)/lib -lscalapack $(MKLLIBS)LLIBS      += $(LLIBS_MKL)INCS       += -I$(MKLROOT)/include/fftw# Use cusolvermp (optional)
# supported as of NVHPC-SDK 24.1 (and needs CUDA-11.8)
#CPP_OPTIONS+= -DCUSOLVERMP -DCUBLASMP
#LLIBS      += -cudalib=cusolvermp,cublasmp -lnvhpcwrapcal# HDF5-support (optional but strongly recommended, and mandatory for some features)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT  ?= /path/to/your/hdf5/installation
LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS       += -I$(HDF5_ROOT)/include# For the VASP-2-Wannier90 interface (optional)
#CPP_OPTIONS    += -DVASP2WANNIER90
#WANNIER90_ROOT ?= /path/to/your/wannier90/installation
#LLIBS          += -L$(WANNIER90_ROOT)/lib -lwannier# For the fftlib library (hardly any benefit for the OpenACC GPU port, especially in combination with MKL's FFTs)
#CPP_OPTIONS+= -Dsysv
#FCL        += fftlib.o
#CXX_FFTLIB  = nvc++ -mp --no_warnings -std=c++11 -DFFTLIB_USE_MKL -DFFTLIB_THREADSAFE
#INCS_FFTLIB = -I./include -I$(MKLROOT)/include/fftw
#LIBS       += fftlib
#LLIBS      += -ldl# For machine learning library vaspml (experimental)
#CPP_OPTIONS += -Dlibvaspml
#CPP_OPTIONS += -DVASPML_USE_CBLAS
#CPP_OPTIONS += -DVASPML_DEBUG_LEVEL=3
#CXX_ML      = mpic++ -mp
#CXXFLAGS_ML = -O3 -std=c++17 -Wall -Wextra
#INCLUDE_ML  =

2.3 编译

在登录节点上编译，然后遇到libcuda.so.1 not found后请求到GPU节点，重新加载编译环境后继续编译。

$ make DEPS=1 -j4 all
... 报错
$ qsub -I ...#请求一个交互式任务
$ nvidia-smi
$ #需要重新加载编译环境
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:41:00.0 Off |                    0 |
| N/A   41C    P0              55W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------++---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ make DEPS=1 -j16 all
...可能有报错(见下文)

2.4 报错解决

OpenACC （如果makefile.include中CPP_OPTIONS使用了指令-D_OPENACC）或者NVCUDA（如果makefile.include中CPP_OPTIONS使用了指令-DNVCUDA）导致MPIX_Query_cuda_support相关报错。报错时会提示对应的文件和行号，该错误一般是一下三个文件（对于gam/std/ncl都一样）：./build/{gam,std,ncl}/{openacc,nvcuda}.f90

解决办法：

vim +行号打开./build/{gam,std,ncl}/{openacc,nvcuda}.F，注释掉下面这4行（行首加入感叹号）

! INTERFACE
! INTEGER(c_int) FUNCTION MPIX_Query_cuda_support() BIND(C, name="MPIX_Query_cuda_support")
! END FUNCTION
! END INTERFACE

并修改其下方CUDA_AWARE_SUPPORT = MPIX_Query_cuda_support() == 1为CUDA_AWARE_SUPPORT = .TRUE.

接着继续编译。

参考：Error: undefined reference to `MPIX_Query_cuda_support'

2.5 测试

单层石墨烯的SCF

INCAR:

  SYSTEM = grapheneISTART = 0; ICHARG = 2ENCUT = 520ISIF = 3ISMEAR = -5 ; SIGMA = 0.05ALGO = Fast
#  NPAR = 3
#########EDIFF = 1E-7PREC = AccurateEDIFFG = -0.01
##########ISPIN = 2#MAGMOM =LCHARG = .TRUE.LWAVE = .TRUE.LORBIT = 11LREAL = .FALSE.
#########SYMPREC = 1E-4ISYM = 1NELM = 200
#########NSW = 0POTIM = 0.5IBRION = -1
#########VDW=DFT-D2#LVDW = .TRUE.#IVDW = 1

KPOINTS:

K-POINTS0
Gamma-Centered25 25 10 0 0

POSCAR: 注意，晶格矩阵中有一点浮点数的误差，仅供测试。

graphene1.000000000000002.4677557588200547    0.0000000001951262   -0.0000000000000000-1.2338785942720587    2.1371404153443971   -0.00000000000000000.0000000000000000    0.0000000000000000   14.9975103391044442C2
Direct0.3333328829999971  0.6666671669999999  0.20000000600000330.6666671540000024  0.3333328579999986  0.2000000060000033

$ export OMP_NUM_THREADS=16 # numer of CPU cores
$ mpirun -np 1 --cpu-bind depth -d $OMP_NUM_THREADS vasp_std | tee vasp_run.outrunning    1 mpi-ranks, with   16 threads/rank, on    1 nodesdistrk:  each k-point on    1 cores,    1 groupsdistr:  one band on    1 cores,    1 groupsOffloading initialized ...    1 GPUs detectedvasp.6.5.0 16Dec24 (build ?? 2025 ??) complexPOSCAR found type information on POSCAR CPOSCAR found :  1 types and       2 ionsReading from existing POTCARscaLAPACK will be used selectively (only on CPU)Reading from existing POTCARLDA part: xc-table for (Slater+PW92), standard interpolationPOSCAR, INCAR and KPOINTS ok, starting setupFFT: planning ... GRIDCFFT: planning ... GRID_SOFTFFT: planning ... GRIDWAVECAR not readentering main loop$ head OUTCARvasp.6.5.0 16Dec24 (build ??) complexexecuted on           LinuxNVGPU date 2025 ??running    1 mpi-ranks, with   16 threads/rank, on    1 nodesdistrk:  each k-point on    1 cores,    1 groupsdistr:  one band on NCORE=   1 cores,    1 groupsOffloading initialized ...    1 GPUs detected$ $ tail -14 OUTCARGeneral timing and accounting informations for this job:========================================================Total CPU time used (sec):       26.150User time (sec):       25.026System time (sec):        1.125Elapsed time (sec):       25.703Maximum memory used (kb):     1377240.Average memory used (kb):          N/AMinor page faults:       263714Major page faults:            0Voluntary context switches:        18839

能成功检测到GPU并在GPU上运行SCF计算。