Read vector into CUDA shared memory
- by Ben
I am new to CUDA and programming GPUs. I need each thread in my block to use a vector of length ndim. So I thought I might do something like this:
extern __shared__ float* smem[];
...
if (threadIddx.x == 0) {
for (int d=0; d<ndim; ++d) {
smem[d] = vector[d];
}
}
__syncthreads();
...
This works fine. However, I seems wasteful that a single thread should do all loading, so I changed the code to
if (threadIdx.x < ndim) {
smem[threadIdx.x] = vector[threadIdx.x];
}
__syncthreads();
which does not work. Why? It gives different results than the above code even when ndim << blockDim.x.