Time gaps between host clEnqueue_xxx calls
Posted
by
dialer
on Stack Overflow
See other posts from Stack Overflow
or by dialer
Published on 2012-06-17T15:13:06Z
Indexed on
2012/06/17
15:16 UTC
Read the original article
Hit count: 207
Consider these OpenCL calls (3 memcpy DtoH, 4313 cl_float
elements each):
clEnqueueReadBuffer(CommandQueue, SpectrumAbsMem, CL_FALSE, 0, SpectrumMemSize, SpectrumAbs, 0, NULL, NULL);
clEnqueueReadBuffer(CommandQueue, SpectrumReMem, CL_FALSE, 0, SpectrumMemSize, SpectrumRe, 0, NULL, NULL);
clEnqueueReadBuffer(CommandQueue, SpectrumImMem, CL_FALSE, 0, SpectrumMemSize, SpectrumIm, 0, NULL, NULL);
When I analyze these with the NVIDIA visual profiler, I see that the actual memcpy operation only takes 8 us, but there is a significant gap of around 130 us after each memcpy. I'm already using the supposedly asynchronous method (the CL_FALSE
in the argument list). When I use only one operation, but with three times the size, the operation is way faster.
Why is the time gap between the actual memcpy operations so huge, whereas the gap between the kernel execution (exactly before these three operations) and the first memcpy is only 7us? Can I get rid of it, or do I need to accumulate more data before starting a memcpy? If so, is there a convenient way how I could combine mutliple arrays into a single contiguous block of memory, but still have a cl_mem object as a separate device memory pointer to each section?
© Stack Overflow or respective owner