Back in December I got some Mellanox ConnectX-6 Dx, now I finally got around to playing with them. I got them because I was interested in two features:
[#]networking #mellanox #tls #homelab
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
I hoped that TLS-offloading would increase throughput or at least keep throughput but reduce cpu load. But when you look at the throughput graph, the TLS-offloading (nginx+hw) is completely useless for small transfers. I'd have needed log charts to better show this, 8.7 MByte/s total for 500 clients repeatedly requesting a file of 10 kBytes. The regular userspace-only nginx can do 323 MBytes/s for the same load. Even with 100 kBytes requests it is still useless (83 MBytes/s).
It only becomes useful in the region somwhere between 1 and 10 MBytes file size.
While offloading TLS to the kernel (kTLS) has some setup cost, it pays off from shortly after 100k, offloading the transmission to the network card seems to be much slower. Since the CPU is nearly idle during this time it seems like setting up the offload is somehow implemented inefficiently.
=> View attached media | View attached media
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Another problem I found is that the TLS hardware offload only seems to support TLS 1.2 ciphers. The datasheet from Mellanox/Nvidia claims "AES-GCM 128/256-bit key" and doesn't give more details.
It worked with the TLS 1.2 cipher ECDHE-RSA-AES256-GCM-SHA384. But as soon as I switched to TLS 1.3 and tried to use for example TLS_AES_256_GCM_SHA384, the kernel didn't use the hardware offload anymore. I'm not a crypto expert, but I'd say that encrypting the actual data after setting up the TLS session once should be the same for both. So it could be a kernel issue.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Setting up the hardware offload was totally easy and painless:
ssl_conf_command Options KTLS; in nginx.conf and then ethtool -K tls-hw-tx-offload on; ethtool -K tls-hw-rx-offload on;, for both nics of the bond.
An easy way to verify is looking at /proc/net/tls_stat.
A website with some helpful info is https://delthas.fr/blog/2023/kernel-tls/ , although the info that only AES128 works seems to be outdated as I got AES256 to work without problems as long as I stayed in TLS1.2.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
As server I used an Epyc 7232P (8-core, Zen 2) with kernel 6.7.9 (elrepo ml), stock kernel mlx5 driver, nginx 1.22.1 and the rest is a stock Rocky 9.3 linux.
The client is a much faster Ryzen 7950X3D (16-core, Zen 4). I set it up like this to really benchmark the server side and not the client.
Both were equipped with a ConnectX-6 Dx card and connected with 2 LACP-bonded 25 GBit/s links, xmit hash layer3+4, so a theoretical bandwidth of 50 GBit/s. I did not use any TLS offloading on the client since it is much faster than the server and never really got warm during the tests.
I used static files on disk on the server, sendfile on, TLS session caching off, socket reuse on.
On the client I used wrk with --header "Connection: Close" to simulate many users downloading a single file and not a few downloading much.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
So I guess the way they implemented the TLS offload, it only makes sense when you have long-running connections that are used to transfer many Mega- or Gigabytes of data. So your regular webserver or fedi instance being hit by many clients downloading a picture doesn't profit.
But maybe I made some mistake in benchmarking or their closed source drivers are much better?
Let's hope the integrated switch function is more promising. I'll have to do some more reading on this it seems, it is based on Open vSwitch which I haven't used before.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel How many concurrent connections were you using?
Maybe this offload was designed for video streaming use cases. They're often the folks who are prepared to pay for features like this.
=> More informations about this toot | More toots from FenTiger@mastodon.social
@FenTiger I used 500 concurrent connections divided up into several threads. I played with this number, from a few dozen to thousands, but it didn't make a notable difference, especially for the "bad" case with small requests.
Yes, video streaming could be an application that would profit from the feature. But I suspected that CDNs and Reverse Proxies would also be a common application for these cards and they would usually also need to deal with many smaller requests.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel Yes, I'd have thought so too.
I don't have direct experience with Mellanox HW, but I've seen what happens inside similar devices from ... shall we say, certain other vendors ... and let's just say that the "control plane" side of things, where the host passes the established connection over to the NIC, can often leave a little to be desired performance-wise.
=> More informations about this toot | More toots from FenTiger@mastodon.social This content has been proxied by September (ba2dc).Proxy Information
text/gemini