Why is rvalue write in shared memory array serialised?
- by CJM
I'm using CUDA 4.0 on a GPU with computing capability 2.1. One of my device functions is the following:
device void test(int n, int* itemp) // itemp is shared memory pointer
{
const int tid = threadIdx.x; const int bdim = blockDim.x;
int i, j, k; bool flag = 0;
itemp[tid] = 0;
for(i=tid; i<n; i+=bdim)
{ // { code that produces some values of "flag" }
}
itemp[tid] = flag;
}
Each thread is checking some conditions and producing a 0/1 flag. Then each thread is writing flag at the tid-th location of a shared int array. The write statement "itemp[tid] = flag;" gets serialized -- though "itemp[tid] = 0;" is not. This is causing huge performance lag which technically should not be there -- I want to avoid it. Please help.