CUDA small kernel 2d convolution - how to do it
- by paulAl
I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads).
I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels.
After a few disappointing attempts to perform a faster convolution I ended up with this code:
http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution.
Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes.
Whether or not you read everything I wrote, my question is:
how can I perform a fast 2D convolution between a relatively large image and a very small kernel (3x3) with CUDA?