CUDA - multiple kernels to compute a single value

Posted by Roger on Stack Overflow See other posts from Stack Overflow or by Roger
Published on 2011-03-13T23:09:40Z Indexed on 2011/03/14 0:10 UTC
Read the original article Hit count: 236

Filed under:
|

Hey, I'm trying to write a kernel to essentially do the following in C

 float sum = 0.0;
 for(int i = 0; i < N; i++){
   sum += valueArray[i]*valueArray[i];
 }
 sum += sum / N;

At the moment I have this inside my kernel, but it is not giving correct values.

int i0 = blockIdx.x * blockDim.x + threadIdx.x;

   for(int i=i0; i<N; i += blockDim.x*gridDim.x){
        *d_sum += d_valueArray[i]*d_valueArray[i];
    }

  *d_sum= __fdividef(*d_sum, N);

The code used to call the kernel is

  kernelName<<<64,128>>>(N, d_valueArray, d_sum);
  cudaMemcpy(&sum, d_sum, sizeof(float) , cudaMemcpyDeviceToHost);

I think that each kernel is calculating a partial sum, but the final divide statement is not taking into account the accumulated value from each of the threads. Every kernel is producing it's own final value for d_sum?

Does anyone know how could I go about doing this in an efficient way? Maybe using shared memory between threads? I'm very new to GPU programming. Cheers

© Stack Overflow or respective owner

Related posts about c

    Related posts about cuda