I finally got the OrangePi 5 Plus board I ordered. I was going to mess around with the NPU with the rknn2 SDK. However, their matrix multiplcation is broken, segfaulting, and the core SDK is closed source. Nothing I can do becides reporting a bug and wait for them to fix it. In the mean time I decide to mess around with OpenCL. Previously I've tried OpenCL on the same RK3588 on the Mixtile Blade 3 demo system. It worked fine. However, the Ubuntu image I used on the OrangePi 5 Plus board doesn't have OpenCL installed. Some googling turned up the solution. But I do need extra troubleshooting to get it working. Hence this post.
The Ubuntu image I used
=> ubuntu-rockchip = Ubuntu 22.04 for Rockchip RK3588 devices
Assuming you have successfully booted into your system. You need to download the ARM Mali GPU blob from rockchip's repository. And put it into /usr/lib/
. As well as the firmware for the GPU if not already installed.
cd /usr/lib sudo wget https://github.com/JeffyCN/mirrors/raw/libmali/lib/aarch64-linux-gnu/libmali-valhall-g610-g6p0-x11-wayland-gbm.so cd /lib/firmware sudo wget https://github.com/JeffyCN/mirrors/raw/libmali/firmware/g610/mali_csffw.bin
Now we need the OpenCL ICD loader. The one in apt will suffice. Then, since we manually downloaded the Mali GPU blob, we need to add it to the OpenCL ICD config file.
sudo apt install mesa-opencl-icd clinfo sudo mkdir -p /etc/OpenCL/vendors echo "/usr/lib/libmali-valhall-g610-g6p0-x11-wayland-gbm.so" | sudo tee /etc/OpenCL/vendors/mali.icd
Finally, we need to statisify the dependencies of the Mali OpenCL.
sudo apt install libxcb-dri2-0 libxcb-dri3-0 libwayland-client0 libwayland-server0 libx11-xcb1
Now you can run clinfo
to see if OpenCL is working.
❯ clinfo Number of platforms 1 Platform Name ARM Platform Platform Vendor ARM Platform Version OpenCL 2.1 v1.g6p0-01eac0.2819f9d4dbe0b5a2f89c835d8484f9cd Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_opencl Platform Extensions function suffix ARM Platform Host timer resolution 1ns Platform Name ARM Platform Number of devices 1 arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. Device Name Mali-LODX r0p0 Device Vendor ARM Device Vendor ID 0xa8670000 Device Version OpenCL 2.1 v1.g6p0-01eac0.2819f9d4dbe0b5a2f89c835d8484f9cd Device UUID 000067a8-0100-0000-0000-000000000000 Driver UUID 1e0cb80a-4d25-a21f-2c18-f7de010f1315 Valid Device LUID No Device LUID 0000-000000000000 Device Node Mask 0 Device Numeric Version 0x801000 (2.1.0) Driver Version 2.1 Device OpenCL C Version OpenCL C 2.0 v1.g6p0-01eac0.2819f9d4dbe0b5a2f89c835d8484f9cd Device C++ for OpenCL Numeric Version 0x400000 (1.0.0) Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 4 Available core IDs 0, 2, 16, 18 Max clock frequency 1000MHz Device Partition (core) Max number of sub-devices 0 Supported partition types None Supported affinity domains (n/a) Max work item dimensions 3 Max work item sizes 1024x1024x1024 Max work group size 1024 Preferred work group size multiple (kernel) 16 Max sub-groups per work group 64 Preferred / native vector sizes char 16 / 4 short 8 / 2 int 4 / 1 long 2 / 1 half 8 / 2 (cl_khr_fp16) float 4 / 1 double 0 / 0 (n/a) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (n/a) Address bits 64, Little-Endian Global memory size 16471937024 (15.34GiB) Error Correction support No Max memory allocation 16471937024 (15.34GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 0 bytes Global 0 bytes Local 0 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 0 Global Memory cache type Read/Write Global Memory cache size 1048576 (1024KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 65536 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 32 bytes Pitch alignment for 2D image buffers 64 pixels Max 2D image size 65536x65536 pixels Max 3D image size 65536x65536x65536 pixels Max number of read image args 128 Max number of write image args 64 Max number of read/write image args 64 Max number of pipe args 16 Max active pipe reservations 1 Max pipe packet size 1024 Local memory type Global Local memory size 32768 (32KiB) Max number of constant args 128 Max constant buffer size 16471937024 (15.34GiB) Max size of kernel argument 1024 Queue properties (on host) Out-of-order execution Yes Profiling Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 2097152 (2MiB) Max size 16777216 (16MiB) Max queues on device 1 Max events on device 1024 Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Sub-group independent forward progress Yes IL version SPIR-V_1.0 ILs with version SPIR-V 0x400000 (1.0.0) printf() buffer size 1048576 (1024KiB) Built-in kernels (n/a) Built-in kernels with version (n/a) Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_opencl Device Extensions with Version cl_khr_global_int32_base_atomics 0x400000 (1.0.0) cl_khr_global_int32_extended_atomics 0x400000 (1.0.0) cl_khr_local_int32_base_atomics 0x400000 (1.0.0) cl_khr_local_int32_extended_atomics 0x400000 (1.0.0) cl_khr_byte_addressable_store 0x400000 (1.0.0) cl_khr_3d_image_writes 0x400000 (1.0.0) cl_khr_int64_base_atomics 0x400000 (1.0.0) cl_khr_int64_extended_atomics 0x400000 (1.0.0) cl_khr_fp16 0x400000 (1.0.0) cl_khr_icd 0x400000 (1.0.0) cl_khr_egl_image 0x400000 (1.0.0) cl_khr_image2d_from_buffer 0x400000 (1.0.0) cl_khr_depth_images 0x400000 (1.0.0) cl_khr_subgroups 0x400000 (1.0.0) cl_khr_subgroup_extended_types 0x400000 (1.0.0) cl_khr_subgroup_non_uniform_vote 0x400000 (1.0.0) cl_khr_subgroup_ballot 0x400000 (1.0.0) cl_khr_il_program 0x400000 (1.0.0) cl_khr_priority_hints 0x400000 (1.0.0) cl_khr_create_command_queue 0x400000 (1.0.0) cl_khr_spirv_no_integer_wrap_decoration 0x400000 (1.0.0) cl_khr_extended_versioning 0x400000 (1.0.0) cl_khr_device_uuid 0x400000 (1.0.0) cl_arm_core_id 0x400000 (1.0.0) cl_arm_printf 0x400000 (1.0.0) cl_arm_non_uniform_work_group_size 0x400000 (1.0.0) cl_arm_import_memory 0x400000 (1.0.0) cl_arm_import_memory_dma_buf 0x400000 (1.0.0) cl_arm_import_memory_host 0x400000 (1.0.0) cl_arm_integer_dot_product_int8 0x400000 (1.0.0) cl_arm_integer_dot_product_accumulate_int8 0x400000 (1.0.0) cl_arm_integer_dot_product_accumulate_saturate_int8 0x400000 (1.0.0) cl_arm_scheduling_controls 0x3000 (0.3.0) cl_arm_controlled_kernel_termination 0x400000 (1.0.0) cl_ext_cxx_for_opencl 0x400000 (1.0.0) NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM Platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM] clCreateContext(NULL, ...) [default] Success [ARM] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1) Platform Name ARM Platform Device Name Mali-LODX r0p0 clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1) Platform Name ARM Platform Device Name Mali-LODX r0p0 clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1) Platform Name ARM Platform Device Name Mali-LODX r0p0 ICD loader properties ICD loader Name OpenCL ICD Loader ICD loader Vendor OCL Icd free software ICD loader Version 2.2.14 ICD loader Profile OpenCL 3.0
If you still see 0 platforms. You should check weather some dependencies are missing using ldd
command. The output should look like this and should not contain any not found
s:
❯ ldd /usr/lib/libmali-valhall-g610-g6p0-x11-wayland-gbm.so linux-vdso.so.1 (0x0000ffffa9bc5000) libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffffa24f0000) libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa24d0000) libdrm.so.2 => /lib/aarch64-linux-gnu/libdrm.so.2 (0x0000ffffa24a0000) libwayland-client.so.0 => /lib/aarch64-linux-gnu/libwayland-client.so.0 (0x0000ffffa2480000) libwayland-server.so.0 => /lib/aarch64-linux-gnu/libwayland-server.so.0 (0x0000ffffa2450000) libX11.so.6 => /lib/aarch64-linux-gnu/libX11.so.6 (0x0000ffffa2300000) libX11-xcb.so.1 => /lib/aarch64-linux-gnu/libX11-xcb.so.1 (0x0000ffffa22e0000) libxcb.so.1 => /lib/aarch64-linux-gnu/libxcb.so.1 (0x0000ffffa22a0000) libxcb-dri2.so.0 => /lib/aarch64-linux-gnu/libxcb-dri2.so.0 (0x0000ffffa2280000) libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffa2050000) libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffa1fb0000) libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffa1e00000) libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffa1dd0000) /lib/ld-linux-aarch64.so.1 (0x0000ffffa9b8d000) libffi.so.8 => /lib/aarch64-linux-gnu/libffi.so.8 (0x0000ffffa1db0000) libXau.so.6 => /lib/aarch64-linux-gnu/libXau.so.6 (0x0000ffffa1d90000) libXdmcp.so.6 => /lib/aarch64-linux-gnu/libXdmcp.so.6 (0x0000ffffa1d70000) libbsd.so.0 => /lib/aarch64-linux-gnu/libbsd.so.0 (0x0000ffffa1d40000) libmd.so.0 => /lib/aarch64-linux-gnu/libmd.so.0 (0x0000ffffa1d20000)
Memory bandwideth on the Orange Pi 5 Plus seems to be slower then on the Mixtile Blade 3. Howerver, driver improvements seem to have increased the compute slightly. Even fixed a few cases on suprising low throughput. You can find the clpeak benchmark results of Mixtile Blade 3 in my older post.
=> Mixtile Blade 3 (RK3588) OpenCL performance
❯ ./clpeak Platform: ARM Platform arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. Device: Mali-LODX r0p0 Driver version : 2.1 (Linux ARM64) Compute units : 4 Clock frequency : 1000 MHz Global memory bandwidth (GBPS) float : 23.41 float2 : 25.46 float4 : 25.79 float8 : 12.82 float16 : 12.65 Single-precision compute (GFLOPS) float : 447.88 float2 : 477.25 float4 : 472.24 float8 : 440.71 float16 : 415.58 Half-precision compute (GFLOPS) half : 447.98 half2 : 888.15 half4 : 920.44 half8 : 896.00 half16 : 855.34 No double precision support! Skipped Integer compute (GIOPS) int : 126.91 int2 : 127.34 int4 : 126.90 int8 : 125.22 int16 : 125.58 Integer compute Fast 24bit (GIOPS) int : 126.86 int2 : 127.27 int4 : 126.79 int8 : 125.33 int16 : 125.67 Transfer bandwidth (GBPS) enqueueWriteBuffer : 7.15 enqueueReadBuffer : 8.13 enqueueWriteBuffer non-blocking : 7.21 enqueueReadBuffer non-blocking : 8.22 enqueueMapBuffer(for read) : 61.56 memcpy from mapped ptr : 9.44 enqueueUnmap(after write) : 63.74 memcpy to mapped ptr : 9.39 Kernel launch latency : 26.27 us
FluidX3D is a fluid dynamics simulation software. This benchmark takes a long time to run on the Orange Pi 5 Plus. While running the SoC maintains around a temperature around 44°C with my baseline cooling setup. I bet it'll thermal throttle without a fan. And I can do better if I get a small fan and blow air directly into the heatsink.
❯ ./make.sh # F32 .-----------------------------------------------------------------------------. | ______________ ______________ | | \ ________ | | ________ / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \_.-" | | "-._/ / | | \ _.-" _ "-._ / | | \.-" _.-" "-._ "-./ | | .-" .-"-. "-. | | \ v" "v / | | \ \ / / | | \ \ / / | | \ \ / / | | \ ' / | | \ / | | \ / FluidX3D Version 2.7 | | ' Copyright (c) Dr. Moritz Lehmann | arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. |----------------.------------------------------------------------------------| | Device ID 0 | Mali-LODX r0p0 | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | Mali-LODX r0p0 | | Device Vendor | ARM | | Device Driver | 2.1 | | OpenCL Version | OpenCL C 2.0 v1.g6p0-01eac0.2819f9d4dbe0b5a2f89c835d8484f9cd | | Compute Units | 4 at 1000 MHz (32 cores, 0.064 TFLOPs/s) | | Memory, Cache | 15708 MB, 1024 KB global / 32 KB local | | Buffer Limits | 15708 MB global, 16085876 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | Grid Domains | 1 x 1 x 1 = 1 | | LBM Type | D3Q19 SRT (FP32/FP32) | | Memory Usage | CPU 272 MB, GPU 1x 1488 MB | | Max Alloc Size | 1216 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 43 | 7 GB/s | 3 | 9999 90% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 43 | ❯ ./make.sh # F16S .-----------------------------------------------------------------------------. | ______________ ______________ | | \ ________ | | ________ / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \_.-" | | "-._/ / | | \ _.-" _ "-._ / | | \.-" _.-" "-._ "-./ | | .-" .-"-. "-. | | \ v" "v / | | \ \ / / | | \ \ / / | | \ \ / / | | \ ' / | | \ / | | \ / FluidX3D Version 2.7 | | ' Copyright (c) Dr. Moritz Lehmann | arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. |----------------.------------------------------------------------------------| | Device ID 0 | Mali-LODX r0p0 | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | Mali-LODX r0p0 | | Device Vendor | ARM | | Device Driver | 2.1 | | OpenCL Version | OpenCL C 2.0 v1.g6p0-01eac0.2819f9d4dbe0b5a2f89c835d8484f9cd | | Compute Units | 4 at 1000 MHz (32 cores, 0.064 TFLOPs/s) | | Memory, Cache | 15708 MB, 1024 KB global / 32 KB local | | Buffer Limits | 15708 MB global, 16085876 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | Grid Domains | 1 x 1 x 1 = 1 | | LBM Type | D3Q19 SRT (FP32/FP16S) | | Memory Usage | CPU 272 MB, GPU 1x 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 59 | 5 GB/s | 4 | 9999 90% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 59 | ❯ ./make.sh # F16C .-----------------------------------------------------------------------------. | ______________ ______________ | | \ ________ | | ________ / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \ | | | | / / | | \ \_.-" | | "-._/ / | | \ _.-" _ "-._ / | | \.-" _.-" "-._ "-./ | | .-" .-"-. "-. | | \ v" "v / | | \ \ / / | | \ \ / / | | \ \ / / | | \ ' / | | \ / | | \ / FluidX3D Version 2.7 | | ' Copyright (c) Dr. Moritz Lehmann | arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'. |----------------.------------------------------------------------------------| | Device ID 0 | Mali-LODX r0p0 | |----------------'------------------------------------------------------------| |----------------.------------------------------------------------------------| | Device ID | 0 | | Device Name | Mali-LODX r0p0 | | Device Vendor | ARM | | Device Driver | 2.1 | | OpenCL Version | OpenCL C 2.0 v1.g6p0-01eac0.2819f9d4dbe0b5a2f89c835d8484f9cd | | Compute Units | 4 at 1000 MHz (32 cores, 0.064 TFLOPs/s) | | Memory, Cache | 15708 MB, 1024 KB global / 32 KB local | | Buffer Limits | 15708 MB global, 16085876 KB constant | |----------------'------------------------------------------------------------| | Info: OpenCL C code successfully compiled. | |-----------------.-----------------------------------------------------------| | Grid Resolution | 256 x 256 x 256 = 16777216 | | Grid Domains | 1 x 1 x 1 = 1 | | LBM Type | D3Q19 SRT (FP32/FP16C) | | Memory Usage | CPU 272 MB, GPU 1x 880 MB | | Max Alloc Size | 608 MB | | Time Steps | 10 | | Kin. Viscosity | 1.00000000 | | Relaxation Time | 3.50000000 | | Reynolds Number | Re < 148 | |---------.-------'-----.-----------.-------------------.---------------------| | MLUPs | Bandwidth | Steps/s | Current Step | Time Remaining | | 19 | 1 GB/s | 1 | 9999 90% | 0s | |---------'-------------'-----------'-------------------'---------------------| | Info: Peak MLUPs/s = 19 |
text/gemini
This content has been proxied by September (ba2dc).