Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS-编程知识

Programming Tensor Cores: NATIVE VOLTA TENSOR CORES WITH CUTLASS

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS 源自于 GTC Silicon Valley-2019: cuTENSOR: High-performance Tensor Operations in CUDA，介绍了 CUTLASS 1.3 中基于 Volta Tensor Core 实现高效矩阵乘法计算的策略。主要内容为以下三点：

CUDA 10.1中mma.sync指令介绍；
Global Memory–>Shared Memory–>RF 的128 bit 访问实现；
Shared Memory 上的无冲突转置。

双缓冲内容缺失。

mma

无论是 slides 中的介绍还是源码实现均是采用自底向上的思路，根据硬件规格确定每个层次上的分块策略。Volta Tensor Core 计算能力是4x4x4，HMMA.884.F16.F16需要两个 Tensor Core 计算两遍。

CUTLASS 中封装的 mma 指令计算 m16n16k4的矩阵乘法。

参考 Modeling Deep Learning Accelerator Enabled GPUs 中的介绍。Warp 内四个连续线程划分为一个 threadgroup，两个 threadgroup 组成一个 octet。每个 octet 串行计算一个 Quad Pair。计算不同 QP 时是具备数据复用的，如下图所示：

下图展示了 QP0中线程与数据的对应关系。

Permuted Shared Memory Tiles

对于全局内存上的列优先矩阵 A，每个线程加载8个元素则可以加载 m64k4的分块。然而根据前面介绍的线程和数据的映射关系，直接保存到 Shared Memory 的话，线程取用时会出现 bank 冲突。CUTLASS 中采用了一种无冲突共享内存排列来实现数据转置。

在这里插入图片描述
第二组线程

在这里插入图片描述第三组线程

在这里插入图片描述
第四组线程

在这里插入图片描述

Pointer Offsets For Permuted Shared Memory

Volta884Multiplicand 中定义了被乘数（ A 和 B）的迭代器：

TileLoadIterator：从 Global Memory 循环读取数据到寄存器；
Volta884ThreadblockMultiplicandStoreIterator：负责 Permuted Shared Memory 的摆放，Volta884ThreadblockMultiplicandStoreIterator::ThreadOffset 与下图对应；
Volta884WarpMultiplicandLoadIterator：从 Shared Memory 取数据。

在这里插入图片描述

Conflict-Free Shared Memory Loads

从 Shared Memory 上加载数据到线程寄存器，仍然分为4步。前两步线程访问的数据相同，因此共计加载 32x4的 A 矩阵。Shared Memory 上的数据可供使用两次。

在这里插入图片描述

Spatially Interleaved

如前所述，每个线程从 Shared Memory 读取8个元素。而执行mma指令时，每个线程提供4个元素。因此计算输出会出现空间交错。对 A 和 B 矩阵进一步分块，一次加载可以支持4次计算。

在这里插入图片描述

参考资料：

# [DOC] Where does cutlass’ detailed GEMM kernel? #526
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
Modeling Deep Learning Accelerator Enabled GPUs
gpgpu-sim_distribution
理解Tensor Core
Flexible Performant GEMM Kernels on GPUs
CUDA Tensor Core编程
PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS
The NVIDIA Titan V Deep Learning Deep Dive: It’s All About The Tensor Cores
9.7.13.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
Numerical Behavior of NVIDIA Tensor Cores
CUDA Ampere Tensor Core HGEMM 矩阵乘法优化笔记 —— Up To 131 TFLOPS!
If we have two or four memory requests by a warp, do they need coalesced access/contiguity? #328
Do bank conflicts increase when using more shared memory?
How does parameter computeType affect the computation?
2.1.10. GEMM Algorithms Numerical Behavior
cuBLAS的使用
RAFT在Knowhere上的一些评估测试[1]
How does parameter computeType affect the computation?
cudnn-frontend/tree/main/samples/samples/conv_sample.cpp
Is a union in C++ actually a class?
A Generalized Micro-kernel Abstraction for GPU Linear Algebra
Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs
Double-buffering in shared memory, details? #227
Efficient GEMM in CUDA
Thread synchronization with syncwarp
Using CUDA Warp-Level Primitives
CUDA微架构与指令集（3）-SASS指令集分类
VOLTA Architecture and performance optimization
How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
Determining registers holding the data after executing LDG.E.128