CUDA, more threads for same work = Longer run time despite better occupancy, Why?

Posted by zenna on Stack Overflow See other posts from Stack Overflow or by zenna
Published on 2010-03-15T18:11:20Z Indexed on 2010/03/16 1:39 UTC
Read the original article Hit count: 331

Filed under:
|
|
|

I encountered a strange problem where increasing my occupancy by increasing the number of threads reduced performance.

I created the following program to illustrate the problem:

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

__global__ void less_threads(float * d_out) {
    int num_inliers;
    for (int j=0;j<800;++j) {
        //Do 12 computations
        num_inliers += threadIdx.x*1;
        num_inliers += threadIdx.x*2;
        num_inliers += threadIdx.x*3;
        num_inliers += threadIdx.x*4;
        num_inliers += threadIdx.x*5;
        num_inliers += threadIdx.x*6;
        num_inliers += threadIdx.x*7;
        num_inliers += threadIdx.x*8;
        num_inliers += threadIdx.x*9;
        num_inliers += threadIdx.x*10;
        num_inliers += threadIdx.x*11;
        num_inliers += threadIdx.x*12;
    }

    if (threadIdx.x == -1)
        d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers;
}

__global__ void more_threads(float *d_out) {
    int num_inliers;
    for (int j=0;j<800;++j) {
        // Do 4 computations
        num_inliers += threadIdx.x*1;
        num_inliers += threadIdx.x*2;
        num_inliers += threadIdx.x*3;
        num_inliers += threadIdx.x*4;
    }
    if (threadIdx.x == -1)
        d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers;
}

int main(int argc, char* argv[]) {
    float *d_out = NULL;
    cudaMalloc((void**)&d_out,sizeof(float)*25000);
    more_threads<<<780,128>>>(d_out);
    less_threads<<<780,32>>>(d_out);

    return 0;
}

Note both kernels should do the same amount of work in total, the (if threadIdx.x == -1 is a trick to stop the compiler optimising everything out and leaving an empty kernel). The work should be the same as more_threads is using 4 times as many threads but with each thread doing 4 times less work.

Significant results form the profiler results are as followsL:

more_threads: GPU runtime = 1474 us,reg per thread = 6,occupancy=1,branch=83746,divergent_branch = 26,instructions = 584065,gst request=1084552

less_threads: GPU runtime = 921 us,reg per thread = 14,occupancy=0.25,branch=20956,divergent_branch = 26,instructions = 312663,gst request=677381

As I said previously, the run time of the kernel using more threads is longer, this could be due to the increased number of instructions.

Why are there more instructions?

Why is there any branching, let alone divergent branching, considering there is no conditional code?

Why are there any gst requests when there is no global memory access?

What is going on here!

Thanks

© Stack Overflow or respective owner

Related posts about cuda

Related posts about Performance