How to optimize Conway's game of life for CUDA?

Posted by nlight on Stack Overflow See other posts from Stack Overflow or by nlight
Published on 2011-01-02T22:50:56Z Indexed on 2011/01/02 22:53 UTC
Read the original article Hit count: 266

Filed under:
|
|

I've written this CUDA kernel for Conway's game of life:

global void gameOfLife(float* returnBuffer, int width, int height) {
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
float p = tex2D(inputTex, x, y);
float neighbors = 0;
neighbors += tex2D(inputTex, x+1, y);
neighbors += tex2D(inputTex, x-1, y);
neighbors += tex2D(inputTex, x, y+1);
neighbors += tex2D(inputTex, x, y-1);
neighbors += tex2D(inputTex, x+1, y+1);
neighbors += tex2D(inputTex, x-1, y-1);
neighbors += tex2D(inputTex, x-1, y+1);
neighbors += tex2D(inputTex, x+1, y-1);
__syncthreads();
float final = 0;
if(neighbors < 2) final = 0;
else if(neighbors > 3) final = 0;
else if(p != 0) final = 1;
else if(neighbors == 3) final = 1;
__syncthreads();
returnBuffer[x + y*width] = final;
}

I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right.

The rest of the app is:

Memcpy input array to a 2d texture inputTex stored in a CUDA array. Output is memcpy-ed from global memory to host and then dealt with.

As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this on someone with practical experience.

© Stack Overflow or respective owner

Related posts about c

    Related posts about cuda