1.硬件理解
1.1对应
1.2 不一定是同时执行
- 例如只有13个sm,每个sm有128个core,而我们创建了1百万个threads,就要同步执行
- 因此,我们倾向于在block的x维设置为32的倍数,防止浪费warp
- warp id打印
#include <stdio.h>
#include <stdlib.h>#include "cuda_runtime.h"
#include "device_launch_parameters.h"__global__ void print_details_of_warps()
{int gid = blockIdx.y * gridDim.x * blockDim.x + blockIdx.x * blockDim.x + threadIdx.x;int warp_id = threadIdx.x / 32;int gbid = blockIdx.y * gridDim.x + blockIdx.x;printf("tid : %d, bid.x : %d, bid.y : %d, gid : %d, warp_id : %d, gbid : %d \n",threadIdx.x, blockIdx.x, blockIdx.y, gid, warp_id, gbid);
}int main(int argc , char** argv)
{dim3 block_size(42);dim3 grid_size(2,2);print_details_of_warps << <grid_size,block_size >> > ();cudaDeviceSynchronize();cudaDeviceReset();return EXIT_SUCCESS;
}
-
需要注意,条件语句不总是会导致发散,当我申请blocksize为64时会分配2个warp(此处能优化计算速度)
-
代码效率计算
-
grid对应kernal函数
-
对kernal函数而言
- thread-core
- block-sm
- grid-device
-
同一个block里执行的数据尽量要靠近
-
CudaDeviceSynchronize (会阻塞CPU,直到所有先前的CUDA调用都完成为止)
-
CudaMemcpy()调用之前会调用CudaDeviceSynchronize
-
cudaMemcpyAsync()这个不会调用,用在stream中,pipeline
-
优化矩阵乘法
-
内存中线性分布,一行行的串联