Time gaps between host clEnqueue_xxx calls

Posted by dialer on Stack Overflow See other posts from Stack Overflow or by dialer
Published on 2012-06-17T15:13:06Z Indexed on 2012/06/17 15:16 UTC
Read the original article Hit count: 212

Filed under:
|
|

Consider these OpenCL calls (3 memcpy DtoH, 4313 cl_float elements each):

clEnqueueReadBuffer(CommandQueue, SpectrumAbsMem, CL_FALSE, 0, SpectrumMemSize, SpectrumAbs, 0, NULL, NULL);
clEnqueueReadBuffer(CommandQueue, SpectrumReMem, CL_FALSE, 0, SpectrumMemSize, SpectrumRe, 0, NULL, NULL);
clEnqueueReadBuffer(CommandQueue, SpectrumImMem, CL_FALSE, 0, SpectrumMemSize, SpectrumIm, 0, NULL, NULL);

When I analyze these with the NVIDIA visual profiler, I see that the actual memcpy operation only takes 8 us, but there is a significant gap of around 130 us after each memcpy. I'm already using the supposedly asynchronous method (the CL_FALSE in the argument list). When I use only one operation, but with three times the size, the operation is way faster.

Why is the time gap between the actual memcpy operations so huge, whereas the gap between the kernel execution (exactly before these three operations) and the first memcpy is only 7us? Can I get rid of it, or do I need to accumulate more data before starting a memcpy? If so, is there a convenient way how I could combine mutliple arrays into a single contiguous block of memory, but still have a cl_mem object as a separate device memory pointer to each section?

© Stack Overflow or respective owner

Related posts about c++

Related posts about c