=> home
This is another subject that is complicated to explain. But is crucial to understand to write efficient code. Especially on embedded systems and integrated GPUs. I explained this with my coworkers recently. I figure to write it down as a reminder. This post mostly references Intel's article on minimizing memory copy[1] and AMD's ROCm OpenCL optimization guide[2].
=> [1]: Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics | [2]: ROCm documentation: OpenCL Optimization
When we first learn OpenCL. We learn that we allocate buffers using the clCreateBuffer
function. Normally the function call looks like this. clCreateBuffer(ctx, CL_MEM_READ_WRITE, some_size, NULL, NULL)
. The CL_MEM_READ_WRITE
flag tells the underlying OpenCL runtime that the GPU want to be able to read and write to said buffer. Occasionally, a read or write only buffer is useful. And the runtime can optimize memory access base on that. Anyway, notice there's more flags avaliable in the doc[3]. Namely CL_MEM_USE_HOST_PTR
, CL_MEM_ALLOC_HOST_PTR
and CL_MEM_COPY_HOST_PTR
.
=> khronos.org: clCreateBuffer
These flags comes with cryptic descriptions in the documentation. In English, they control where the OpenCL buffer is located. By default the buffer lives on the VRAM. Thus the following code likely causes a DMA transfer from the host to the GPU. Which can be slow at times as it involves system operations to allocate page-aligned memory then a DMA request. malloc()
is hated because how slow it is even though it cached. A DMA is worse that there's no way to cache it. According to AMD's document. The buffer read/write calls can only provide about 2/3 of peak performance.
cl_mem buffer = clCreateBuffer(ctx, CL_MEM_READ_WRITE, some_size, NULL, NULL); // This can be slow!! Context switch is slow! clEnqueueWriteBuffer(queue, buffer, CL_FALSE, 0, some_size, some_data, 0, NULL, NULL);
## Optimizing OpenCL data transfer on integrated GPUs
cl_mem buffer = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, some_size, NULL, NULL); // This is low latency because buffer lives on the host memory void* mapped = clEnqueueMapBuffer(queue, buffer, CL_TRUE, CL_MAP_WRITE, 0, some_size, 0, NULL, NULL, &err); memcpy(mapped, some_data, some_size); // Low latency because buffer is on the host memory clEnqueueUnmapMemObject(queue, buffer, mapped, 0, NULL, NULL); // Now OpenCL maps the host memory into the GPU. One system op for 2 operations. clEnqueueNDRangeKernel(queue, kernel, ...);
This is faster becaue 2 reasons. 1. The buffer now lives on somewhere the CPU can access. Thus mapping is basically free. And 2, OpenCL maps the CPU memory to GPU during kernel execution (an async operation). We can hide that latency unlike in plane buffer copy. In practice I've seen a streaming heavy application use 25% less CPU (dropping from 200% to 150% overall) by just changing the allocation flags and use mapping.
The above benifit only works on integrated GPUs (or integrated DSP/FPGA). On discrete devices, where the device does not share the host memory, the above is likely not an optimization. As now we force the discrete GPU to DMA during kernel execution. Which is at least an order of magnitude slower compared to VRAM.
No, I haven't seen it in the wild. Most likely because even on integrated GPUs, there are still dedicated chunks of memory for the GPU (split configured by BIOS/UEFI). If you take a look at Windows task manager. It has a shared and dedicated memory section for the GPU. The shared reagon is usually half of avaliable memory for the syetm. And is used if CL_MEM_ALLOC_HOST_PTR
is applied. Otherwise the dedicated reagon.
No. This is an optimization that vendors and devlopers can take advantage of. But it's not a requirement. Nor a part of the OpenCL standard. It has to be measured by testing the performance.
Yes. CL_DEVICE_HOST_UNIFIED_MEMORY
returns CL_TRUE
if the device shares the host memory. Again, it's not a guarantee that zero copy is enabled. It's just a hint.
text/gemini
This content has been proxied by September (ba2dc).