How to optimize Conway's game of life for CUDA?

Posted by nlight on Stack Overflow See other posts from Stack Overflow or by nlight
Published on 2011-01-02T22:50:56Z Indexed on 2011/01/02 22:53 UTC
Read the original article Hit count: 297

Filed under:

GPGPU

I've written this CUDA kernel for Conway's game of life:

global void gameOfLife(float* returnBuffer, int width, int height) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float p = tex2D(inputTex, x, y); float neighbors = 0; neighbors += tex2D(inputTex, x+1, y); neighbors += tex2D(inputTex, x-1, y); neighbors += tex2D(inputTex, x, y+1); neighbors += tex2D(inputTex, x, y-1); neighbors += tex2D(inputTex, x+1, y+1); neighbors += tex2D(inputTex, x-1, y-1); neighbors += tex2D(inputTex, x-1, y+1); neighbors += tex2D(inputTex, x+1, y-1); __syncthreads(); float final = 0; if(neighbors < 2) final = 0; else if(neighbors > 3) final = 0; else if(p != 0) final = 1; else if(neighbors == 3) final = 1; __syncthreads(); returnBuffer[x + y*width] = final; }

I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right.

The rest of the app is:

Memcpy input array to a 2d texture inputTex stored in a CUDA array. Output is memcpy-ed from global memory to host and then dealt with.

As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this on someone with practical experience.

Related posts about cuda

CUDA Driver API vs. CUDA runtime

as seen on Stack Overflow - Search for 'Stack Overflow'
When writing CUDA applications, you can either work at the driver level or at the runtime level as illustrated on this image (The libraries are CUFFT and CUBLAS for advanced math): I assume the tradeoff between the two are increased performance for the low-evel API but at the cost of increased… >>> More
Updating a Cuda 4.0 project to Cuda 4.2

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a VS2010 project that was tested with CUDA 4.0, today I installed CUDA 4.2 and I want to update this project, the problem is that when I try to run the project it asks me for cudart32_40_17.dll, but since this is CUDA 4.2 I only have on my folders (C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4… >>> More
How to solve CUDA crash when run CUDA example fluidsGL?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I use ubuntu 12.04 64 bits with GTX560Ti. I install CUDA by following instruction: wget http: //developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/cudatoolkit_4.2.9_lin ux_64_ubuntu11.04.run wget http: //developer.download.nvidia.com/compute/cuda/4_2/rel/drivers/devdriver_4… >>> More
Context migration in CUDA.NET

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm currently using CUDA.NET library by GASS. I need to initialize cuda arrays (actually cublas vectors, but it doesn't matters) in one CPU thread and use them in other CPU thread. But CUDA context which holding all initialized arrays and loaded functions, can be attached to only one CPU thread. There… >>> More
CUDA on GeForce 8600GT

as seen on Super User - Search for 'Super User'
I have got the cuda driver, toolkit and sdk installed in Ubuntu 10.04. I'm using nVidia Geforce 8600 GT card. Official website says my card is CUDA supported. But on running the deviceQuery that comes with the cuda sdk, I'm getting the following output. ./deviceQuery Starting... CUDA Device Query… >>> More

Developer IT