今天继续讲解异步并发执行中的Streams:
Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread
【 两个不同流中的命令不能同时执行,如果host线程在这两个命令中间发布了下面任意操作】:
· a page-locked host memory allocation,【分配page-locked内存】
· a device memory allocation,【分配显存】
· a device memory set,【指普通的memset()函数的cuda版本: cudaMemset,一般用来初始化或者显存清零之类的用途】
· a memory copy between two addresses to the same device memory,【从两个其他地址到相同显存地址的复制操作】
· any CUDA command to the NULL stream,【任何对默认流发布的命令】
· a switch between the L1/shared memory configurations described in Compute Capability 3.x and Compute Capability 7.x.【这计算能力3.X和7.x上进行L1 / shared memory的大小切换配置】
For devices that support concurrent kernel execution【内核并发执行】 and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:
· Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
· Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.
Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution:
· All independent operations should be issued before dependent operations,
· Synchronization of any kind should be delayed as long as possible.
The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers.
For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed. If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)
then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].
On devices that do support concurrent data transfers, the two streams of the code sample of Creation and Destruction do overlap: The memory copy from host to device issued to stream[1] overlaps with the memory copy from device to host issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the device supports overlap of data transfer and kernel execution). However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. If the code is rewritten as above, the kernel executions overlap (assuming the device supports concurrent kernel execution) since the second kernel launch is issued to stream[1] before the memory copy from device to host is issued to stream[0]. In that case however, the memory copy from device to host issued to stream[0] only overlaps with the last thread blocks of the kernel launch issued to stream[1] as per Implicit Synchronization, which can represent only a small portion of the total execution time of the kernel.
本文备注/经验分享:
Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread:
For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete: 字面意思是:对于支持并发kernel执行的,同时计算能力小于等于3.0的设备(即Fermi和初代Kepler---请注意这CUDA 9个时候已经放弃了Fermi支持了,这里应该改成,仅对于初代Kepler(3.0)才好),需要查询或者等待(依赖)某流中的之前的某kernel完成状态的任何操作:
(1)该操作必须等待之前的CUDA Context中的所有流中的所有操作都开始执行后,才能开始执行;
(2)该操作将阻止之后的当前Context中的所有流中的所有操作执行,直到该操作如前所说的,所依赖的某kernel完成执行,或者查询结果返回(操作未完成)。
但是实际中,老卡上的第二点是不对的。主要是老卡只有一个物理上的Kernel Execution Queue, 和2个DMA Queues(Device -> Host 和 Host -> Device),导致了很多情况下原本能并发执行的操作不能并发执行。但是什么操作是所谓的“需要查询或者等待(依赖)某流中的之前的某kernel完成状态”的操作?
显然常见的只有Async结尾的cudaMemcpy*()函数,
以及,应当附加上cudaStreamQuery()
广义的说还有cudaMemcpy*()无async的同步版本和各种分配函数之类的,但这种就包含的广了。
和这里的这段英文说的不同的是,根据实际经验,在老卡(Fermi和计算能力3.0)上使用cudaStreamQuery,非但不像手册这段说的,会可能阻止多种操作的并发性,反而可能会增加老卡上的并发执行效果。(从老卡+Profiler的时间轴上能很容易看到这点)。 好在从计算能力3.5开始(例如K40?),Maxwell, Pascal这些,都具有Hyper-Q了。不存在这些种种限制了。用户也不用学习各种命令发布技巧了。新点的卡任何一种(无论深度,广度,还是用户自己随心所欲的任何一种发布方式),只要逻辑上能并行的,资源也允许的,卡就能给你并行,非常给力。
The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers. overlap指的是 执行的操作在时间上重叠(同时执行),比如这个图:
这个重叠比较多。 一共启动了6个kernel
两个流之间的执行重叠程度,取决于每个流中的命令发布顺序(特别对于无Hyper-Q的卡,这个很重要。例如手册说过的深度优先和广度优先这两种顺序),取决于是否设备支持数据传输和kernel执行重叠,取决于(设备是否支持)并发kernel执行,和/或(取决于)并发数据传输。(计算能力5.0(包含)一下的双向传输需要专业卡,计算能力5.2(包含)家用卡也支持数据双向传输(双Copy Engines)。双向原本是专业卡的特性,现在都开放),类似的一些TCC才能用远程桌面或者服务中使用CUDA,现在家用卡也可以了。很多以前的特性需要专业卡,现在都开放了。 类似的,以前NVENC需要买license才能用。现在NV家用卡开放编码能力,限两路同时编码。目前NV还有的常见限制是专业卡的double,ECC,编码以及虚拟化。(Titan系列算是准专业卡,连Jetson Tx2也有ECC哟)
However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. 然后,因计算能力3.0或者更低的设备上的隐式同步问题,(多个)kernel之间的执行可能不能重叠,因为第二个流stream[1]中的kernel启动命令,是在第一个流中stream[0]中的D->H传输命令发布以后,这样它将阻塞,直到第一个流stream[0]中的第一个kernel执行完成以后(才能开始执行)。老卡有很多限制的。发布命令给多个流,需要注意顺序。多种问题。3.5+的卡无任何问题,只要是多流,逻辑上应该并发的,资源允许的情况下就会并发。而不管一些隐晦的限制条件。不过现在的GPU卡都至少5.0以上了
有不明白的地方,请在本文后留言
或者在我们的技术论坛bbs.gpuworld.cn上发帖