Optimize code performance when odd/even threads are doing different things in CUDA
- by Orion Nebula
Hi all!
I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector .... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector
Ex.
vector 1 is V1(1) V1(2) V1(3) V1(4)
vector 2 is V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)
I have written a Cuda code to do this: (Pds has the elements of the first vector in shared memory, Nds the second Vector)
//instead of using %2 .. i check for the first bit to decide if number is odd/even -- faster
if ((tx & 0x0001) == 0x0000)
Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();
Is there anyway to further accelerate this code or avoid divergence ?
Thanks