我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第27天,我们今天开始讲解性能,希望在接下来的73天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。
本文共计985字,阅读时间15分钟
注意:最近涉及到的基础概念很多,所以我们备注的内容也非常详细,希望各位学员认真阅读
At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor.
As described in Hardware Multithreading, a GPU multiprocessor relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects a warp that is ready to execute its next instruction, if any, and issues the instruction to the active threads of the warp. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency, and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely "hidden". The number of instructions required to hide a latency of L clock cycles depends on the respective throughputs of these instructions (see Arithmetic Instructions for the throughputs of various arithmetic instructions). Assuming maximum throughput for all instructions, it is: 8L for devices of compute capability 3.x since a multiprocessor issues a pair of instructions per warp over one clock cycle for four warps at a time, as mentioned in Compute Capability 3.x.
For devices of compute capability 3.x, the eight instructions issued every cycle are four pairs for four different warps, each pair being for the same warp.
The most common reason a warp is not ready to execute its next instruction is that the instruction's input operands are not available yet.
If all input operands are registers, latency is caused by register dependencies, i.e., some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In the case of a back-to-back register dependency (i.e., some input operand is written by the previous instruction), the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions for different warps during that time. Execution time varies depending on the instruction, but it is typically about 11 clock cycles for devices of compute capability 3.x, which translates to 44 warps for devices of compute capability 3.x (assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed). This is also assuming enough instruction-level parallelism so that schedulers are always able to issue pairs of instructions for each warp.
If some input operand resides in off-chip memory, the latency is much higher: 200 to 400 clock cycles for devices of compute capability 3.x. The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands (i.e., arithmetic instructions most of the time) to the number of instructions with off-chip memory operands is low (this ratio is commonly called the arithmetic intensity of the program). For example, assume this ratio is 30, also assume the latencies are 300 cycles on devices of compute capability 3.x. Then about 40 warps are required for devices of compute capability 3.x (with the same assumptions as in the previous paragraph).
Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence (Memory Fence Functions) or synchronization point (Memory Fence Functions). A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block to complete execution of instructions prior to the synchronization point. Having multiple resident blocks per multiprocessor can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization points.
The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration of the call (Execution Configuration), the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading. Register and shared memory usage are reported by the compiler when compiling with the -ptxas-options=-v option.
The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory and the amount of dynamically allocated shared memory.
The number of registers used by a kernel can have a significant impact on the number of resident warps. For example, for devices of compute capability 6.x, if a kernel uses 64 registers and each block has 512 threads and requires very little shared memory, then two blocks (i.e., 32 warps) can reside on the multiprocessor since they require 2x512x64 registers, which exactly matches the number of registers available on the multiprocessor. But as soon as the kernel uses one more register, only one block (i.e., 16 warps) can be resident since two blocks would require 2x512x65 registers, which are more registers than are available on the multiprocessor. Therefore, the compiler attempts to minimize register usage while keeping register spilling (see Device Memory Accesses) and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.
Each double variable and each long long variable uses two registers.
The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended. Applications can also parameterize execution configurations based on register file size and shared memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory bandwidth of the device, all of which can be queried using the runtime (see reference manual).
The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible.
本文备注/经验分享:
这一章节说明了GPU性能的主要因素---SM上如何能尽量发挥性能。并介绍了延迟,TLP,ILP,以及一些warps在SM上的单独方面的知识。同时给出了一个特殊的计算能力---计算能力3.X上的例子,来进行说明。 我们知道,一个GPU系统(例如一台GPU服务器)虽然整体上的每个部件都很重要。但往往最最要的是GPU。而GPU的每个部件都很重要,本章节则着重说了GPU的SM(除了SM,GPU常见还有重要的部件有DMA Engine(Copy Engine),L2,显存控制器,显存)。 注意:这里给出的计算能力3.X,是非常特殊的一个计算能力,也叫Kepler。这一代和之前的Fermi完全不同,和之后的Maxwell/Pascal+也不相同。因此建议读者本章节关于计算能力3.X的内容可以看,也可以不看。因为本章节首先有一些错误,其次,Kepler非常特殊。 在Kepler上得到的经验不像其他所有计算能力上的,大部分不具有可迁移性。这些知识只对这一代有用(Kepler)。因为我将重新组织一下这个章节。同时着重介绍一下3.X的需要注意的地方。 Kepler实际上是失败的一代。我们之前说过,要尽量并行很多组件,甚至还说了SSD,内存,PCI-E传输等等方面。而今天我们则重点说一下GPU。毕竟我们这本手册是CUDA手册,而不是通泛的系统优化指南。 GPU从很多天之前,我们说过是通过海量的线程来进行性能提升的。并且当时还用来GTX1080举例子,说他有20个SM,每个SM最多能驻留2048个线程。 能同时最大的真正同时执行4W多个线程,大家还记得吧。 这些线程,都是在SM上执行的。而要想充分发挥SM的能力,本章节说明了如下几个要点: (1)线程要多。
As described in Hardware Multithreading, a GPU multiprocessor relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps.
如同在前面的Hardware Multithreads章节的说法,GPU的SM依赖于TLP并行(也就是很多线程一起执行)来充分利用它的(SM的)各种功能单元。因此各种功能单元的利用度直接和上面的驻留线程/warps数量相关。 这个说法多少可以参考的。因为每个周期,SM的1个或者多个scheduler,总是从这些驻留的warps中,选取1个或者多个处于就绪状态的,从中挑选出来1条到2条指令,发射给对应的SM里面的功能单元。而驻留的warps越多,这种找到能发射指令/执行指令的机会越多。也越有利于各种功能单元的性能发挥。因此这是为何说, “各种功能单元的利用度直接和上面的驻留线程/warps数量相关。”。 这种暴力硬堆线程,堆warps的方式,是目前的GPU上并行执行的主力方式。(本章节后面说明了另外一种方式)。 为何必须必须要堆这么多的线程在1个SM上?为何人家CPU的一个核心(还记得之前的说法,大致从级别上说,CPU的核心和GPU的SM是平级的么)只需要一个线程就够了?顶多2个线程。GPU的却需要这么多的多?这是因为GPU设计的时候是为了“最大吞吐率”,而CPU设计的时候是为了“最小延迟”。
本章节提出来一个back to back issue的概念,如果一个代码流中(你可以看成1个warp,或者CPU的1个线程)有两条指令,后一条指令需要立刻用到前一条指令的结果,此时CPU被设计成,无论如何付出多大代价(整体性能,整体功耗之类)也要尽快展现因果,将前一条的指令的结果尽量的能给后一条指令用。因此很多时候CPU是可以back to back issue的,两条互相依赖的指令可以立刻被连续发射执行。而GPU则不舍得付出这个代价,它的SM,SP数量太多了,真要像CPU那样做,功耗将大到无法想象,此时GPU必须引入一个相当长的延迟。
这里说了一下,很多Kepler指令具有11个周期的延迟。这意味着,一旦warp中的某一条指令被执行了,想用到它的结果,必须需要等待11个周期以后。虽然GPU这样做减少了立刻展现因果,所导致的功耗开销,但是GPU不能因此损失整体性能(吞吐率/利用率),那么怎么办?这里就引入了GPU的主力特色,TLP,线程级并行。一旦一个线程需要被stall住,例如刚才说的11个周期,那么这11个周期内,其他线程就会插足进来,趁机执行其他线程的指令,然后第二个线程(精确的说说warp,因为它们是一个整体)也卡住,则第三个warp进来....如此类推,等卡上一堆线程/warps后,最开始的那个线程已经就绪了,可以再次被执行了。因此GPU通过这种方式(反复切换线程)消除了CPU上必须用来贡献给快速展现因果的那些晶体管,也节省了功耗。却没有牺牲整体性能(因为它总是有海量的warps在上面,一个执行不动了,下一个就能立刻执行)。 这就是TLP。 这也说明了为何GPU只有当上去很多很多的线程后,才能有效的发挥性能。而只上少数的一些线程,不能发挥性能的原因。
(2) 章节还给出了另外一种不常见的叫ILP的东西, 这个东西和刚才的TLP切换线程来掩盖中间的延迟有点类似,但是细节上稍微有些差距: 既然刚才方式1中的例子,1个warp中因为等待某条指令的结果而卡住,除了横向切换到其他线程,我们还有没有别的办法?有的,你完全可以不用这条指令的结果,自然可以往下继续执行。 例如说,一个kernel里面要循环累加10个数字(假设这10个数字都是float,都在寄存器中,不考虑其他方面,例如从显存读取之类的), 一种写法是可以: s = s + a0; s = s + a1; s = s + a2; ... s = s + a9; 这种每一行都会依赖上一行的累加结果,中间在上一行的结果出来之前(例如需要等11个周期),warp会卡住的,只能横向切换到其他warps(TLP)。 但如果我写成: sa = sa + a0; sb = sb + a1; sa = sa + a2; sb = sb + a3; .... sa = sa + a8; sb = sb + a9; s = sa + sb; 我这里使用了两个sa/sb来作为加法链的中间结果。 这样我等效的减少了延迟。因为偶数行不需要等待奇数行的计算结果,就可以继续往下进展。当然最后我需要多一句sa + sb来将两个部分结果拼凑起来,这样的话,原本中间需要横向切换到其他11个warps才行,现在等效的只需要切换到5-6个warps(11/2 = 5.5)即可。 大家看是不是这样。因为偶数行中间的奇数行引入了一次无关计算,不需要立刻使用偶数行的结果。这样可以减少一半的驻留线程数要求。这样写出来的kernel,只需要1个warps具有5-6个伙伴warps在SM上即可。换句话说,原本SM上驻留了,假设N个warps才能达到的效果,现在只需要一半,N/2个warp,就能完成同样的延迟掩盖效果,发挥一样的性能。这就是为何我们之前说过的,有的时候并不需要堆满warp,用点技巧,可以不需要100%的occupancy的。本例原本如果需要78%的occupancy所能达到的性能,现在用39%的occupancy, 性能无变化。 那你阅读到这里可能会问,为啥我要费心写两条依赖关系链/加法链,我直接正常的写,我感觉最容易理解,直接靠硬件TLP即可。我何苦费心自己上ILP?答案是,用较少的线程数/warps数,可以每个线程/warps分配到更多的资源。有些代码(例如矩阵乘法)需要每个线程/warps具有很多的寄存器的。上太多的warps,每个人能分到的寄存器就很少了。虽然occupancy上去了,但因为每个人的资源太少,性能不会太好的。本章节的最后也稍微提到了一点这个。因为Kepler上的运算单元太多了(每SM),不上ILP,光靠TLP的切换,很难喂饱它们。本章节说了,Kepler上1个SM就有192个SP,32个一组,这就是6组了。除了这6组SP外,还有一组LSU(load store unit),和1组Shared memory,执行单元堆的丧心病狂。而之前的Fermi只有1组或者2组SP,而之后的Maxwell和Pascal 6.1,也不过是128个SP,4组。甚至6.0和volta卡,只有64个SP,2组。所以Kepler这上去8组执行单元,真心很难喂饱。 难到上到了最大程度的TLP(100% occupancy),依然喂不饱。此时必须依靠ILP。ILP等于是线程内部的上下指令之间的并行了(需要无关指令) 所以ILP是很多需要使用Kepler的人,痛苦的必须经历的阶段。(没有手册这个章节的这么轻描淡写,实际上Kepler上不用ILP是完全不行的),而不是手册这个章节的说只是TLP这个主力的辅助手段。好在Kepler已经成为了过去式了。实际上在大部分的测试和应用中,1个192 SP的Kepler,很多时候性能和只有128 SP的Maxwell/Pascal差不多(折算到同频下)等效于NV浪费掉了64个SP。因为你无法有效的喂饱它们。
Kepler的缺点NV自然不会多说,但在Maxwell出来后,NV出了一本Maxwell优化指南(对Pascal同样适用),里面大幅度的表扬了Maxwell这些点: ILP和必须指令双发射(Kepler是4个scheduler,至少8个执行单元),才能利用设备,已经成为历史。现在的Maxwell降低了延迟,大部分的指令只有6个周期的延迟甚至更少(相比Kepler的11个),减少了ILP的需要性和TLP所需要的覆盖程度(基本减少了一半,以前100%的occupancy, 现在基本上50%在maxwell+上即可),充分利用SP们,也只需要单发射即可。这有利于单条长依赖链的应用代码的性能发挥。所以本章节的很多东西,在Maxwell+都不需要了。但我们依然有的时候还需要考虑它们,特别是在追求极致性能的情况下。本章节的其他东西都是字面了。可以看一下。此外,Maxwell+算是很折中的考虑了,从性能和功耗上。不在乎功耗可以考虑GCN,GCN大部分的指令都可以back to back issue的,在考虑了内置的4个周期的天然掩盖的情况下,大部分的指令都是0周期的。非常丧心病狂。GCN可以直接执行刚才的那条单个依赖链的累加。每次发射都可以往下进展一条。及时当前的SP阵列上只有1个warp。(Maxwell需要至少6个warp每SP组,或者等效的中间6条其他不相关指令ILP) 进一步需要阅读本章节的人,可以看一下github上scott的wiki和代码。看一下如何在Maxwell+上,非常低的Occupancy/TLP的情况下(大约10%多一点点的occupancy),能达到99.5%+的理论峰值性能。
链接是:https://github.com/NervanaSystems/maxas ,maxas by Scrtt Gray ,这个代码是ILP的艺术。可以参考。打开看一下下:第一栏最后的1代表下一条指令可以在下一个周期立刻执行,不需要有任何等待。这代码通过ILP,消除了maxwell上的6个周期的延迟。这代码在Maxwell/Pascal(6.1)上都测试通过.可以无障碍使用。所以这也是为何我们经常说,实际上pascal只是maxwell的制程缩减版本(28->16nm),绝大部分的微架构测试代码在两个上面取得完全一致的结果的,说明这两个本来就是一个。
此外,关于矩阵乘法(sgemm), 这个感兴趣的读者还是需要看的.因为这个是用CUDA的人越不过的门槛。
注意这段的这个性能超过了当时版本的cublas, 而且后来nv还将这代码吸收进入了cublas。 Scott最后去了Intel,这是后话。
而在Kepler上,ILP是非常重要的东西,
有不明白的地方,请在本文后留言
或者在我们的技术论坛bbs.gpuworld.cn上发帖