在加速应用程序的过程中,我有一个非常简单的内核,它进行类型转换,如下所示:
__global__ void UChar2FloatKernel(float *out, unsigned char *in, int nElem){
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if(i<nElem)
out[i] = (float) in[i];
}
全局内存访问是合并的,据我理解,使用共享内存也是没有好处的,因为没有对同一内存进行多次读取。有没有人知道是否有任何优化可以执行,以加快这个内核。输入和输出数据已经在设备上,因此不需要主机到设备内存副本。
发布于 2014-01-24 19:30:37
在这样的代码上可以执行的最大的优化是使用驻留线程并增加每个线程执行的事务数。虽然CUDA块调度模型非常轻量级,但它并不是免费的,而启动许多包含只执行单个内存负载和单内存存储的线程的块将产生大量的块调度开销。因此,只要启动尽可能多的块,就可以“填充”GPU的所有SM,并让每个线程做更多的工作。
第二个明显的优化是切换到128个字节的内存事务以处理负载,这将使您获得实际的带宽利用率增益。在费米或开普勒GPU上,这不会像第一代和第二代硬件那样提供更大的性能提升。
将此作为一个简单的基准:
__global__
void UChar2FloatKernel(float *out, unsigned char *in, int nElem)
{
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if(i<nElem)
out[i] = (float) in[i];
}
__global__
void UChar2FloatKernel2(float *out,
const unsigned char *in,
int nElem)
{
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
for(; i<nElem; i+=gridDim.x*blockDim.x) {
out[i] = (float) in[i];
}
}
__global__
void UChar2FloatKernel3(float4 *out,
const uchar4 *in,
int nElem)
{
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
for(; i<nElem; i+=gridDim.x*blockDim.x) {
uchar4 ival = in[i]; // 32 bit load
float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
out[i] = oval; // 128 bit store
}
}
int main(void)
{
const int n = 2 << 20;
unsigned char *a = new unsigned char[n];
for(int i=0; i<n; i++) {
a[i] = i%255;
}
unsigned char *a_;
cudaMalloc((void **)&a_, sizeof(unsigned char) * size_t(n));
float *b_;
cudaMalloc((void **)&b_, sizeof(float) * size_t(n));
cudaMemset(b_, 0, sizeof(float) * size_t(n)); // warmup
for(int i=0; i<5; i++)
{
dim3 blocksize(512);
dim3 griddize(n/512);
UChar2FloatKernel<<<griddize, blocksize>>>(b_, a_, n);
}
for(int i=0; i<5; i++)
{
dim3 blocksize(512);
dim3 griddize(8); // 4 blocks per SM
UChar2FloatKernel2<<<griddize, blocksize>>>(b_, a_, n);
}
for(int i=0; i<5; i++)
{
dim3 blocksize(512);
dim3 griddize(8); // 4 blocks per SM
UChar2FloatKernel3<<<griddize, blocksize>>>((float4*)b_, (uchar4*)a_, n/4);
}
cudaDeviceReset();
return 0;
}
给我一个小费米装置:
>nvcc -m32 -Xptxas="-v" -arch=sm_21 cast.cu
cast.cu
tmpxft_000014c4_00000000-5_cast.cudafe1.gpu
tmpxft_000014c4_00000000-10_cast.cudafe2.gpu
cast.cu
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z18UChar2FloatKernel2PfPKhi' for 'sm_2
1'
ptxas : info : Function properties for _Z18UChar2FloatKernel2PfPKhi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 5 registers, 44 bytes cmem[0]
ptxas : info : Compiling entry function '_Z18UChar2FloatKernel3P6float4PK6uchar4
i' for 'sm_21'
ptxas : info : Function properties for _Z18UChar2FloatKernel3P6float4PK6uchar4i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 8 registers, 44 bytes cmem[0]
ptxas : info : Compiling entry function '_Z17UChar2FloatKernelPfPhi' for 'sm_21'
ptxas : info : Function properties for _Z17UChar2FloatKernelPfPhi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 3 registers, 44 bytes cmem[0]
tmpxft_000014c4_00000000-5_cast.cudafe1.cpp
tmpxft_000014c4_00000000-15_cast.ii
>nvprof a.exe
======== NVPROF is profiling a.exe...
======== Command: a.exe
======== Profiling result:
Time(%) Time Calls Avg Min Max Name
40.20 6.61ms 5 1.32ms 1.32ms 1.32ms UChar2FloatKernel(float*, unsigned char*, int)
29.43 4.84ms 5 968.32us 966.53us 969.46us UChar2FloatKernel2(float*, unsigned char const *, int)
26.35 4.33ms 5 867.00us 866.26us 868.10us UChar2FloatKernel3(float4*, uchar4 const *, int)
4.02 661.34us 1 661.34us 661.34us 661.34us [CUDA memset]
在后两个内核中,与4096块相比,只使用8个块可以大大提高速度,这证实了每线程多个工作项是提高这种内存绑定、低指令计数内核性能的最佳方法。
发布于 2014-01-25 10:46:57
这里是一个cpu版本的函数和4个gpu内核。3内核来自@talonmies答案,我添加了kernel2,它只使用向量数据类型。
// cpu version for comparison
void UChar2Float(unsigned char *a, float *b, const int n){
for(int i=0;i<n;i++)
b[i] = (float)a[i];
}
__global__ void UChar2FloatKernel1(float *out, const unsigned char *in, int nElem){
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if(i<nElem) out[i] = (float) in[i];
}
__global__ void UChar2FloatKernel2(float4 *out, const uchar4 *in, int nElem){
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if(i<nElem) {
uchar4 ival = in[i]; // 32 bit load
float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
out[i] = oval; // 128 bit store
}
}
__global__ void UChar2FloatKernel3(float *out, const unsigned char *in, int nElem) {
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
for(; i<nElem; i+=gridDim.x*blockDim.x)
{
out[i] = (float) in[i];
}
}
__global__ void UChar2FloatKernel4(float4 *out, const uchar4 *in, int nElem) {
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
for(; i<nElem; i+=gridDim.x*blockDim.x)
{
uchar4 ival = in[i]; // 32 bit load
float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
out[i] = oval; // 128 bit store
}
}
在我的Geforce GT 640上,下面是计时结果:
simpleKernel (cpu): 0.101463 seconds.
simpleKernel 1 (gpu): 0.007845 seconds.
simpleKernel 2 (gpu): 0.004914 seconds.
simpleKernel 3 (gpu): 0.005461 seconds.
simpleKernel 4 (gpu): 0.005461 seconds.
因此,我们可以看到,kernel2,它只使用向量类型,是赢家。我已经对(32 * 1024 * 768)元素进行了这些测试。nvprof输出也如下所示:
Time(%) Time Calls Avg Min Max Name
91.68% 442.45ms 4 110.61ms 107.43ms 119.51ms [CUDA memcpy DtoH]
3.76% 18.125ms 1 18.125ms 18.125ms 18.125ms [CUDA memcpy HtoD]
1.43% 6.8959ms 1 6.8959ms 6.8959ms 6.8959ms UChar2FloatKernel1(float*, unsigned char const *, int)
1.10% 5.3315ms 1 5.3315ms 5.3315ms 5.3315ms UChar2FloatKernel3(float*, unsigned char const *, int)
1.04% 5.0184ms 1 5.0184ms 5.0184ms 5.0184ms UChar2FloatKernel4(float4*, uchar4 const *, int)
0.99% 4.7816ms 1 4.7816ms 4.7816ms 4.7816ms UChar2FloatKernel2(float4*, uchar4 const *, int)
发布于 2014-01-24 17:09:26
您可以使用const __restrict__
限定符来修饰输入数组,该限定符通知编译器数据是只读的,而不是任何其他指针的别名。这样,编译器将检测到访问是统一的,并可以使用一个只读缓存(常量缓存或在计算能力>=3.5上称为纹理缓存的只读数据缓存)来优化访问。
还可以使用__restrict__
限定符修饰输出数组,以建议编译器进行其他优化。
最后,DarkZeros的建议值得遵循。
https://stackoverflow.com/questions/21332040
复制相似问题