How to properly cast a global memory array using the uint4 vector in CUDA to increase memory throughput?
- by charis
There are generally two techniques to increase the memory throughput of the global memory on a CUDA kernel; memory accesses coalescence and accessing words of at least 4 bytes. With the first technique accesses to the same memory segment by threads of the same half-warp are coalesced to fewer transactions while be accessing words of at least 4 bytes this memory segment is effectively increased from 32 bytes to 128.
To access 16-byte instead of 1-byte words when there are unsigned chars stored in the global memory, the uint4 vector is commonly used by casting the memory array to uint4:
uint4 *text4 = ( uint4 * ) d_text;
var = text4[i];
In order to extract the 16 chars from var, i am currently using bitwise operations. For example:
s_array[j * 16 + 0] = var.x & 0x000000FF;
s_array[j * 16 + 1] = (var.x >> 8) & 0x000000FF;
s_array[j * 16 + 2] = (var.x >> 16) & 0x000000FF;
s_array[j * 16 + 3] = (var.x >> 24) & 0x000000FF;
My question is, is it possible to recast var (or for that matter *text4) to unsigned char in order to avoid the additional overhead of the bitwise operations?