cuda编程

参考资料

https://nyu-cds.github.io/python-gpu/02-cuda/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

thread执行过程: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture

并行和并发的区别:https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

https://blog.csdn.net/java_zero2one/article/details/51477791

GPU结构

GPU更多强调的是改善系统吞吐,不提供复杂的逻辑控制,故其集成了更多的计算单元,同时支撑的指令的类型和数量较少

The core computational unit, which includes control, arithmetic, registers and typically some cache, is replicated some number of times and connected to memory via a network.

cuda最高层的抽象:

  • a hierarchy of thread groups
  • shared memories
  • barrier synchronization

CPU和GPU的区别

Threading resources

Execution pipelines on host systems can support a limited number of concurrent threads. For example, servers that have two 32 core processors can run only 64 threads concurrently (or small multiple of that if the CPUs support simultaneous multithreading). By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 2048 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C++ Programming Guide) On GPUs with 80 multiprocessors, this leads to more than 160,000 concurrently active threads.

Threads

Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads(这要怎么理解?为什么还有wait的过程 => 可能是这个warp需要等待数据,等待输入输出), it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution(只要进入到GPU,进程就会占用特定资源). In short, CPU cores are designed to minimize latency for a small number of threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.

cuda thread的映射过程

cuda Thread执行过程

part 1: thread block的assignment

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

  • 每个kernel都对应一个thread grid,该grid由多个thread block组成,不同block在kernel初始化过程中被创建到不同的SM上;
  • 同一个block内的threads在SM上并发执行(并行==>在不同的处理单元上处理不同的任务,并发==>在同一个处理单元上执行多个任务,任务交替使用处理单元)
  • 不同block可以在同一个SM上运行

part2: thread block的执行

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp.

When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block.

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

1. SM创建、调度、执行thread的基本单元是warp,每个warp由32个线程组成;同一个warp内的thread有相同的program地址,不同的instruction counter和register state;这意味着每个thread虽然执行相同的程序且独立运行,但是可能拥有不同的执行过程。

2. SM将每个thread block以warp(包括32个线程)为单位进行切分,并通过warp scheduler对warps进行调度。

3. 程序执行开始的时候,同一个warp内各thread执行相同的指令;但在执行过程中,程序可能拥有data-dependent的分支,使得部分thread的执行路径发生变化;warp会执行所有分支,在执行过程中不同于当前分支的thread会被暂时叫停。

思考:在CNN类的应用,GPU底层thread的分配和执行情况是怎样的?

cuda内存管理

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/179929.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

学c语言可以过CCT里的c++吗?

学习 C 语言可以为学习 C 奠定一些基础,但它们是不同的语言,有各自独特的特点和用途。最近很多小伙伴找我,说想要一些c语言的资料,然后我根据自己从业十年经验,熬夜搞了几个通宵,精心整理了一份「c语言资料…

用于部署汽车AI项目的全面自动化数据流程

如何创建、优化和扩展汽车 AI 的数据流程 想到汽车行业的人工智能 (AI) 时,脑海中可能会立即浮现未来的道路上遍布自动驾驶汽车的情景。虽然这一切尚未实现,但汽车行业已在 AI 方面取得诸多进步,不仅安全性提高,车内体验也得到改…

jenkins+centos7上传发布net6+gitlab

工作中实践了一下jenkins的操作,所以记录一下这次经验 首先安装好jenkins并注册自己的jenkins账号 因为我们的项目代码管理使用的是gitlab,在开始之前先在jenkins上安装gitlab的插件,安装之后应该是要重启jenkins的服务,后续jen…

OpenAI 上线新功能力捧 RAG,开发者真的不需要向量数据库了?

近期, OpenAI 的开发者大会迅速成为各大媒体及开发者的热议焦点,有人甚至发出疑问“向量数据库是不是失宠了?” 这并非空穴来风。的确,OpenAI 在现场频频放出大招,宣布推出 GPT-4 Turbo 模型、全新 Assistants API 和一…

使用Postman进行压力测试

1.打开Postman新建测试接口 2.点击右边保存,选择一个文件集合,如果没有就创建,然后保存 就是这个东西,这里不便展示出来,压力测试需要在文件夹里面进行 3.选择要测试的接口,iterations 表示请求发起次数&a…

模组知识(1)-CRA-光学知识

#灵感# CRA算是光学基础知识的一部分,而且最近项目确实color shading 挺严重的。以前记的知识不全,这次再次整理一下。常学常新。 目录 sensor CRA: CRA : Lens CRA: lens CRA和sensor CRA不同大小关…

不变式和橄榄树-UMLChina建模知识竞赛第4赛季第20轮

DDD领域驱动设计批评文集 做强化自测题获得“软件方法建模师”称号 《软件方法》各章合集 参考潘加宇在《软件方法》和UMLChina公众号文章中发表的内容作答。在本文下留言回答。 只要最先答对前3题,即可获得本轮优胜。第4题为附加题,对错不影响优胜者…

springBoot 入门一 :创建springBoot项目

创建springBoot项目 配置maven 项目报错处理

3.4-初识Container

常用的docker container命令: 1、基于image创建docker container命令: docker run lvdapiaoliang/hello-docker 2、列举当前本地正在运行的container容器命令: docker container ls 3、列举当前本地所有的container容器命令(包括正在运行的和…

【多线程面试题二十五】、说说你对AQS的理解

文章底部有个人公众号:热爱技术的小郑。主要分享开发知识、学习资料、毕业设计指导等。有兴趣的可以关注一下。为何分享? 踩过的坑没必要让别人在再踩,自己复盘也能加深记忆。利己利人、所谓双赢。 面试官:说说你对AQS的理解 参…