CUDA, more threads for same work = Longer run time despite better occupancy, Why?
- by zenna
I encountered a strange problem where increasing my occupancy by increasing the number of threads reduced performance.
I created the following program to illustrate the problem:
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
__global__ void less_threads(float * d_out) {
int num_inliers;
for (int j=0;j<800;++j) {
//Do 12 computations
num_inliers += threadIdx.x*1;
num_inliers += threadIdx.x*2;
num_inliers += threadIdx.x*3;
num_inliers += threadIdx.x*4;
num_inliers += threadIdx.x*5;
num_inliers += threadIdx.x*6;
num_inliers += threadIdx.x*7;
num_inliers += threadIdx.x*8;
num_inliers += threadIdx.x*9;
num_inliers += threadIdx.x*10;
num_inliers += threadIdx.x*11;
num_inliers += threadIdx.x*12;
}
if (threadIdx.x == -1)
d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers;
}
__global__ void more_threads(float *d_out) {
int num_inliers;
for (int j=0;j<800;++j) {
// Do 4 computations
num_inliers += threadIdx.x*1;
num_inliers += threadIdx.x*2;
num_inliers += threadIdx.x*3;
num_inliers += threadIdx.x*4;
}
if (threadIdx.x == -1)
d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers;
}
int main(int argc, char* argv[]) {
float *d_out = NULL;
cudaMalloc((void**)&d_out,sizeof(float)*25000);
more_threads<<<780,128>>>(d_out);
less_threads<<<780,32>>>(d_out);
return 0;
}
Note both kernels should do the same amount of work in total, the (if threadIdx.x == -1 is a trick to stop the compiler optimising everything out and leaving an empty kernel). The work should be the same as more_threads is using 4 times as many threads but with each thread doing 4 times less work.
Significant results form the profiler results are as followsL:
more_threads: GPU runtime = 1474 us,reg per thread = 6,occupancy=1,branch=83746,divergent_branch = 26,instructions = 584065,gst request=1084552
less_threads: GPU runtime = 921 us,reg per thread = 14,occupancy=0.25,branch=20956,divergent_branch = 26,instructions = 312663,gst request=677381
As I said previously, the run time of the kernel using more threads is longer, this could be due to the increased number of instructions.
Why are there more instructions?
Why is there any branching, let alone divergent branching, considering there is no conditional code?
Why are there any gst requests when there is no global memory access?
What is going on here!
Thanks