CUDA - multiple kernels to compute a single value
- by Roger
Hey, I'm trying to write a kernel to essentially do the following in C
float sum = 0.0;
for(int i = 0; i < N; i++){
sum += valueArray[i]*valueArray[i];
}
sum += sum / N;
At the moment I have this inside my kernel, but it is not giving correct values.
int i0 = blockIdx.x * blockDim.x + threadIdx.x;
for(int i=i0; i<N; i += blockDim.x*gridDim.x){
*d_sum += d_valueArray[i]*d_valueArray[i];
}
*d_sum= __fdividef(*d_sum, N);
The code used to call the kernel is
kernelName<<<64,128>>>(N, d_valueArray, d_sum);
cudaMemcpy(&sum, d_sum, sizeof(float) , cudaMemcpyDeviceToHost);
I think that each kernel is calculating a partial sum, but the final divide statement is not taking into account the accumulated value from each of the threads. Every kernel is producing it's own final value for d_sum?
Does anyone know how could I go about doing this in an efficient way? Maybe using shared memory between threads? I'm very new to GPU programming. Cheers