My OpenCL kernel is slower on faster hardware.. But why?
- by matdumsa
Hi folks,
As I was finishing coding my project for a multicore programming class I came up upon something really weird I wanted to discuss with you.
We were asked to create any program that would show significant improvement in being programmed for a multi-core platform. I’ve decided to try and code something on the GPU to try out OpenCL. I’ve chosen the matrix convolution problem since I’m quite familiar with it (I’ve parallelized it before with open_mpi with great speedup for large images).
So here it is, I select a large GIF file (2.5 MB) [2816X2112] and I run the sequential version (original code) and I get an average of 15.3 seconds.
I then run the new OpenCL version I just wrote on my MBP integrated GeForce 9400M and I get timings of 1.26s in average.. So far so good, it’s a speedup of 12X!!
But now I go in my energy saver panel to turn on the “Graphic Performance Mode” That mode turns off the GeForce 9400M and turns on the Geforce 9600M GT my system has. Apple says this card is twice as fast as the integrated one.
Guess what, my timing using the kick-ass graphic card are 3.2 seconds in average… My 9600M GT seems to be more than two times slower than the 9400M..
For those of you that are OpenCL inclined, I copy all data to remote buffers before starting, so the actual computation doesn’t require roundtrip to main ram. Also, I let OpenCL determine the optimal local-worksize as I’ve read they’ve done a pretty good implementation at figuring that parameter out..
Anyone has a clue?
edit: full source code with makefiles here http://www.mathieusavard.info/convolution.zip
cd gimage
make
cd ../clconvolute
make
put a large input.gif in clconvolute and run it to see results