=> home
Mixtile Blade 3[1] is an intresting dev board. It runs on a RK 3855 SoC, the successor of the RK3399. Which a whole lot of other boards uses. Including QuartzPro64[2], ITX-3588J[3] and Rock Pi 5[4]. The 16GB model Blade 3 is priced at $369, much more expensive then the Rock Pi 5 at 189$ and the expected price of QuartzPro64 at ~$300.
Mixtile Blade 3 however, has a trick up it's sleve. It allows networking to other Mixtile Blade 3s directly through PCIe, up to what I assume to be 32Gbps (4GB/s). And have a custom cluster case to house 4 nodes in a single box. The vendor have helpfully setup a demo machine for customers to login an try the board before purchase. So I took the liberty and ran some OpenCL benchmark. I too need some numbers to decide if this is a good board.
=> [1]: Mixtile blade 3 | [2]: Pine64 QuartzPro64 | [3]: Firefly ITX-3588J | [4]: Rock Pi 5
root@blade3:~/clpeak/build# free -h total used free shared buff/cache available Mem: 15Gi 520Mi 13Gi 33Mi 1.3Gi 14Gi Swap: 0B 0B 0B root@blade3:~/clpeak/build# ./clpeak Platform: ARM Platform arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'. Device: Mali-LODX r0p0 Driver version : 2.1 (Linux ARM64) Compute units : 4 Clock frequency : 1000 MHz Global memory bandwidth (GBPS) float : 24.90 float2 : 26.84 float4 : 27.19 float8 : 13.56 float16 : 13.11 Single-precision compute (GFLOPS) float : 248.73 float2 : 470.16 float4 : 466.81 float8 : 435.33 float16 : 411.15 Half-precision compute (GFLOPS) half : 441.93 half2 : 878.47 half4 : 909.91 half8 : 886.29 half16 : 845.66 No double precision support! Skipped Integer compute (GIOPS) int : 125.12 int2 : 125.74 int4 : 125.19 int8 : 123.79 int16 : 124.38 Integer compute Fast 24bit (GIOPS) int : 125.30 int2 : 125.82 int4 : 125.16 int8 : 123.82 int16 : 124.39 Transfer bandwidth (GBPS) enqueueWriteBuffer : 8.15 enqueueReadBuffer : 9.28 enqueueWriteBuffer non-blocking : 8.13 enqueueReadBuffer non-blocking : 9.29 enqueueMapBuffer(for read) : 64.52 memcpy from mapped ptr : 10.29 enqueueUnmap(after write) : 65.33 memcpy to mapped ptr : 10.20 Kernel launch latency : 59.75 us
Some intresting things see from the above result:
Here's the result from mixbench[5]. Which tells the same story.
=> [5]: mixbench - benchmark tool for evaluating GPUs on mixed operational intensity kernels
root@blade3:~/mixbench/mixbench-opencl/build# ./mixbench-ocl-ro mixbench-ocl/read-only (v0.04) Use "-h" argument to see available options arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '5'. ------------------------ Device specifications ------------------------ Platform: ARM Platform Device: Mali-LODX r0p0/ARM Driver version: 2.1 Address bits: 64 GPU clock rate: 1000 MHz Total global mem: 15699 MB Max allowed buffer: 15699 MB OpenCL version: OpenCL 2.1 v1.g6p0-01eac0.efb75e2978d783a80fe78be1bfb0efc1 Total CUs: 4 ----------------------------------------------------------------------- Buffer size: 256MB Workgroup size: 256 Elements per workitem: 8 Workitem fusion degree: 4 Workitem stride: NDRange Buffer allocation: Device allocated Timer: CL event based Warning: Double precision computations are not supported Loading kernel source file... Precompilation of kernels... [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] ----------------------------------------------------------------------------- CSV data ----------------------------------------------------------------------------- Experiment ID, Single Precision ops,,,, Double precision ops,,,, Half precision ops,,,, Integer operations,,, Compute iters, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Iops/byte, ex.time, GIOPS, GB/sec 0, 0.250, 7.66, 4.38, 17.53, 0.125, 0.00, inf, inf, 0.500, 7.63, 8.79, 17.59, 0.250, 7.64, 4.39, 17.56 1, 0.750, 7.55, 13.33, 17.77, 0.375, 0.00, inf, inf, 1.500, 7.55, 26.68, 17.79, 0.750, 11.85, 8.49, 11.32 2, 1.250, 7.60, 22.08, 17.66, 0.625, 0.00, inf, inf, 2.500, 7.58, 44.25, 17.70, 1.250, 7.51, 22.34, 17.87 3, 1.750, 7.56, 31.06, 17.75, 0.875, 0.00, inf, inf, 3.500, 7.56, 62.14, 17.75, 1.750, 7.74, 30.33, 17.33 4, 2.250, 7.54, 40.06, 17.81, 1.125, 0.00, inf, inf, 4.500, 7.58, 79.65, 17.70, 2.250, 8.84, 34.16, 15.18 5, 2.750, 7.55, 48.89, 17.78, 1.375, 0.00, inf, inf, 5.500, 7.57, 97.47, 17.72, 2.750, 10.28, 35.92, 13.06 6, 3.250, 7.56, 57.73, 17.76, 1.625, 0.00, inf, inf, 6.500, 7.64, 114.24, 17.58, 3.250, 11.79, 37.00, 11.38 7, 3.750, 7.67, 65.66, 17.51, 1.875, 0.00, inf, inf, 7.500, 7.70, 130.74, 17.43, 3.750, 9.21, 54.63, 14.57 8, 4.250, 5.21, 109.44, 25.75, 2.125, 0.00, inf, inf, 8.500, 5.20, 219.21, 25.79, 4.250, 5.17, 110.28, 25.95 9, 4.750, 5.20, 122.64, 25.82, 2.375, 0.00, inf, inf, 9.500, 5.22, 244.17, 25.70, 4.750, 5.51, 115.64, 24.35 10, 5.250, 5.19, 135.72, 25.85, 2.625, 0.00, inf, inf, 10.500, 5.20, 270.86, 25.80, 5.250, 5.99, 117.68, 22.42 11, 5.750, 5.21, 148.07, 25.75, 2.875, 0.00, inf, inf, 11.500, 5.21, 296.32, 25.77, 5.750, 6.47, 119.23, 20.74 12, 6.250, 5.22, 160.78, 25.72, 3.125, 0.00, inf, inf, 12.500, 5.21, 321.74, 25.74, 6.250, 6.99, 120.01, 19.20 13, 6.750, 5.20, 174.38, 25.83, 3.375, 0.00, inf, inf, 13.500, 5.18, 349.77, 25.91, 6.750, 7.49, 120.88, 17.91 14, 7.250, 5.21, 186.84, 25.77, 3.625, 0.00, inf, inf, 14.500, 5.21, 373.61, 25.77, 7.250, 8.02, 121.30, 16.73 15, 7.750, 5.20, 200.21, 25.83, 3.875, 0.00, inf, inf, 15.500, 5.19, 400.58, 25.84, 7.750, 8.63, 120.60, 15.56 16, 8.250, 5.19, 213.29, 25.85, 4.125, 0.00, inf, inf, 16.500, 5.20, 426.15, 25.83, 8.250, 9.15, 121.05, 14.67 17, 8.750, 5.20, 226.04, 25.83, 4.375, 0.00, inf, inf, 17.500, 5.21, 450.53, 25.74, 8.750, 9.66, 121.51, 13.89 18, 9.250, 5.18, 239.84, 25.93, 4.625, 0.00, inf, inf, 18.500, 5.19, 478.80, 25.88, 9.250, 10.19, 121.84, 13.17 20, 10.250, 5.17, 266.06, 25.96, 5.125, 0.00, inf, inf, 20.500, 5.19, 530.25, 25.87, 10.250, 11.24, 122.37, 11.94 22, 11.250, 5.18, 291.23, 25.89, 5.625, 0.00, inf, inf, 22.500, 5.19, 581.48, 25.84, 11.250, 12.29, 122.86, 10.92 24, 12.250, 5.21, 315.78, 25.78, 6.125, 0.00, inf, inf, 24.500, 5.19, 633.22, 25.85, 12.250, 13.33, 123.32, 10.07 28, 14.250, 5.27, 362.84, 25.46, 7.125, 0.00, inf, inf, 28.500, 5.37, 712.89, 25.01, 14.250, 15.44, 123.85, 8.69 32, 16.250, 5.63, 387.39, 23.84, 8.125, 0.00, inf, inf, 32.500, 5.77, 755.59, 23.25, 16.250, 17.54, 124.32, 7.65 40, 20.250, 6.61, 410.87, 20.29, 10.125, 0.00, inf, inf, 40.500, 6.78, 801.87, 19.80, 20.250, 21.76, 124.90, 6.17 48, 24.250, 7.68, 423.68, 17.47, 12.125, 0.00, inf, inf, 48.500, 7.82, 831.97, 17.15, 24.250, 25.99, 125.23, 5.16 56, 28.250, 8.74, 433.81, 15.36, 14.125, 0.00, inf, inf, 56.500, 8.89, 852.87, 15.10, 28.250, 30.23, 125.42, 4.44 64, 32.250, 9.80, 441.61, 13.69, 16.125, 0.00, inf, inf, 64.500, 9.91, 873.93, 13.55, 32.250, 34.48, 125.53, 3.89 80, 40.250, 11.92, 453.21, 11.26, 20.125, 0.00, inf, inf, 80.500, 12.06, 895.56, 11.13, 40.250, 43.00, 125.64, 3.12 96, 48.250, 14.07, 460.37, 9.54, 24.125, 0.00, inf, inf, 96.500, 14.16, 914.98, 9.48, 48.250, 51.59, 125.54, 2.60 128, 64.250, 18.36, 469.77, 7.31, 32.125, 0.00, inf, inf, 128.500, 18.40, 937.59, 7.30, 64.250, 94.20, 91.55, 1.42 192, 96.250, 108.46, 119.11, 1.24, 48.125, 0.00, inf, inf, 192.500, 114.73, 225.20, 1.17, 96.250, 140.42, 92.00, 0.96 256, 128.250, 144.26, 119.32, 0.93, 64.125, 0.00, inf, inf, 256.500, 153.10, 224.86, 0.88, 128.250, 186.91, 92.09, 0.72 --------------------------------------------------------------------------------------------------------------------------------------------------------------------
I tried to get smallptGPU to work. But it ended up needing X11 to run. I can install a dummy X11 on their demo system. But decided I'm not going that far on someone else's system that is provided for free. From the numbers above. One of a Mixtile 3 16G board is GPU wise is about 1/3 of a Nvidia Xavier board. Pricing at 1/3 of the price but with much more connectivity.
I am very exited with the future the next generation PinePhone Pro with a RK3588. Up to 32GB RAM, 8 cores and a very nice integrated GPU. A future revision of Mixtile with DDR5 memory would be amazing. GPU computing wise, this board is intresting. Normally we expect 1:1 or 1:2 floating point to interger computation ratio. But the Mali GPU on the RK3855 is a 1:4. This make this board unsuitable for applications like crypto mining, hash cracking, and some scientific computation. To be fair, this is a mobile SoC. So these are not it's intended use.
All in all, not too shabby. 470 GFLOPS on an embedded system? That's faster than my laptop's Intel HD 620 integrated GPU at 380 GFLOPS. I'd also assume 26GB/s is the avaliable memory from the GPU. This is definitely one of the nicest board I've ever seen. I guess sutable for a hyper-converged ARM cluster. Besides nice CPU and GPU, each node equipped with a 6TOPS NPU for AI inference, a SATA port (need breakout cable) for storage, 2 2.5Gb ethernet ports, and like I said above, 2 8G PCIe links that can act as direct network interfaces.
I want one or more of these boards to add to my home lab. My current environment is a cluster of my HoneyComb LX2K and a Raspberry Pi. The RPi is more of a a monitor in case I need direct access to the HoneyComb's BMC. These Mixtile 3 can add quite a lot computing power. With the NPUs I think I can start adding BERT to my Gemini search engine. I clould also run SALSA on GPU, reducing some search time. Still, the upgrade is quite expensive for some gain. All while I'm not out of storage space. Nor running into CPU limits.
I do want and need more nodes for my ongoing project using OpenDHT as a Web3 base system (man, I hate that term, I'd call it decentralized applications). With high speed networking, It'll be easier to spot data races and scalablity issues in development. I also need a backup plain in case my current server goes down for good. Mixtile 3 with a SATA SSD should be up to the task.
If someone were to donate a pile of Mixtile 3, I'd be very happy to run them as a cluster and start a business. IDK, some cheap sever-less service that I can rent out. And allow AI inference via API calls and load balancing. Or some sponsered Web3 research.
I also tested the GPU performance on my server with the Mesa and Clover driver. It's known to be a bit slow, but I'm suprised by how slow it is. 4GB/s VRAM bandwidth. Wow.... I need to upgrade to ROCm instead of using Clover. (Oh, the GPU is a RX 560. It's the same as the RX 570, but with a different core congituation.)
❯ ./clpeak Platform: Clover Device: AMD Radeon RX 570 Series (polaris10, LLVM 14.0.6, DRM 3.40, 5.10.35-00001-g107b6c90afff) Driver version : 22.1.3 (Linux ARM64) Compute units : 32 Clock frequency : 1244 MHz Global memory bandwidth (GBPS) float : 3.90 float2 : 3.90 float4 : 3.90 float8 : 3.74 float16 : 3.01 Single-precision compute (GFLOPS) float : 2517.36 float2 : 2515.34 float4 : 2511.42 float8 : 2502.04 float16 : 2492.74 No half precision support! Skipped Double-precision compute (GFLOPS) double : 316.87 double2 : 316.49 double4 : 316.32 double8 : 315.44 double16 : 314.53 Integer compute (GIOPS) int : 1010.15 int2 : 1009.38 int4 : 1007.81 int8 : 1004.66 int16 : 1007.57 Integer compute Fast 24bit (GIOPS) int : 4853.92 int2 : 4568.43 int4 : 4517.10 int8 : 4538.71 int16 : 4337.04 Transfer bandwidth (GBPS) enqueueWriteBuffer : 4.12 enqueueReadBuffer : 4.01 enqueueWriteBuffer non-blocking : 4.14 enqueueReadBuffer non-blocking : 4.11 enqueueMapBuffer(for read) : 10275.04 memcpy from mapped ptr : 4.14 enqueueUnmap(after write) : 8411.19 memcpy to mapped ptr : 4.07 Kernel launch latency : 183.67 us
text/gemini
This content has been proxied by September (ba2dc).