Back in December I got some Mellanox ConnectX-6 Dx, now I finally got around to playing with them. I got them because I was interested in two features:
[#]networking #mellanox #tls #homelab
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
I hoped that TLS-offloading would increase throughput or at least keep throughput but reduce cpu load. But when you look at the throughput graph, the TLS-offloading (nginx+hw) is completely useless for small transfers. I'd have needed log charts to better show this, 8.7 MByte/s total for 500 clients repeatedly requesting a file of 10 kBytes. The regular userspace-only nginx can do 323 MBytes/s for the same load. Even with 100 kBytes requests it is still useless (83 MBytes/s).
It only becomes useful in the region somwhere between 1 and 10 MBytes file size.
While offloading TLS to the kernel (kTLS) has some setup cost, it pays off from shortly after 100k, offloading the transmission to the network card seems to be much slower. Since the CPU is nearly idle during this time it seems like setting up the offload is somehow implemented inefficiently.
=> View attached media | View attached media
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Another problem I found is that the TLS hardware offload only seems to support TLS 1.2 ciphers. The datasheet from Mellanox/Nvidia claims "AES-GCM 128/256-bit key" and doesn't give more details.
It worked with the TLS 1.2 cipher ECDHE-RSA-AES256-GCM-SHA384. But as soon as I switched to TLS 1.3 and tried to use for example TLS_AES_256_GCM_SHA384, the kernel didn't use the hardware offload anymore. I'm not a crypto expert, but I'd say that encrypting the actual data after setting up the TLS session once should be the same for both. So it could be a kernel issue.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Setting up the hardware offload was totally easy and painless:
ssl_conf_command Options KTLS; in nginx.conf and then ethtool -K tls-hw-tx-offload on; ethtool -K tls-hw-rx-offload on;, for both nics of the bond.
An easy way to verify is looking at /proc/net/tls_stat.
A website with some helpful info is https://delthas.fr/blog/2023/kernel-tls/ , although the info that only AES128 works seems to be outdated as I got AES256 to work without problems as long as I stayed in TLS1.2.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
As server I used an Epyc 7232P (8-core, Zen 2) with kernel 6.7.9 (elrepo ml), stock kernel mlx5 driver, nginx 1.22.1 and the rest is a stock Rocky 9.3 linux.
The client is a much faster Ryzen 7950X3D (16-core, Zen 4). I set it up like this to really benchmark the server side and not the client.
Both were equipped with a ConnectX-6 Dx card and connected with 2 LACP-bonded 25 GBit/s links, xmit hash layer3+4, so a theoretical bandwidth of 50 GBit/s. I did not use any TLS offloading on the client since it is much faster than the server and never really got warm during the tests.
I used static files on disk on the server, sendfile on, TLS session caching off, socket reuse on.
On the client I used wrk with --header "Connection: Close" to simulate many users downloading a single file and not a few downloading much.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
So I guess the way they implemented the TLS offload, it only makes sense when you have long-running connections that are used to transfer many Mega- or Gigabytes of data. So your regular webserver or fedi instance being hit by many clients downloading a picture doesn't profit.
But maybe I made some mistake in benchmarking or their closed source drivers are much better?
Let's hope the integrated switch function is more promising. I'll have to do some more reading on this it seems, it is based on Open vSwitch which I haven't used before.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
After experimenting with the (underwhelming) performance of HTTPS offload of my Mellanox/Nvidia ConnectX-6 Dx smart NIC last week, I had one more idea how the crypto offload could be useful to me: encrypting NFS on a fileserver.
My test was one client doing random reads over NFS with fio. Since the test with the many HTTPS clients last week didn't perform well probably due to setting up the TLS session being slow, I now did a test with just one client and one TCP connection.
As you can see in the chart, offloading NFS encryption with the new RPC-with-TLS mode works and is even the fastest NFS option on this server hardware - but it won't even saturate a 10G link and is far slower than the unencrypted variant.
As the card is also able to do IPSec crypto offloading I also tried running the unencrypted NFS through an IPSec tunnel for auth and encryption, but the performance of it is totally useless.
So unfortunately I'm still underwhelmed with the performance of the crypto offload.
[#]networking #nfs #mellanox
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Getting the NFS RPC-with-TLS encryption offloaded was a bit more tricky:
Either the kernel ktls doesn't currently support offloading TLS 1.3 at all or at least the Mellanox driver doesn't support it yet. The code in the mlx5 kernel driver makes it clear that it currently just supports TLS 1.2.
But NFS RPC-with-TLS is very new code, so it is only designed to work with TLS 1.3.
I had to make some dirty hacks to tlshd (the userspace daemon that initiates the TLS connection before handing it off to the kernel) to get it to work by forcing TLS 1.2 and a matching cipher on the client and server.
So this is probably not something you want to run in prod until the mlx5 driver gains TLS 1.3 offloading.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Since I now had just one TCP connection, LACP-bundling my two links became useless since the xmit_hash_policy=layer3+4 results to all packets being sent over the same link. So 25 GBit or roughly 3.1 GBytes/s was the theoretical limit.
I could see several nfsd kernel threads being used and being spread over different cores. So the NFS part profits from multiple cores. But probably the ktls part doesn't because all the data is stuffed into one TCP connection in the end. Maybe there is some path for future optimization? NFS RPC-with-TLS is very new code, so I have some hope that it's speed will improve in the future.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel do you have hardware RSS queues configured? and does nfsd support RSS-aware pinned threads? this massively improved my CPU/IPC overheads when I was doing similar stuff in Samba.
=> More informations about this toot | More toots from gsuberland@chaos.social
@electronic_eel I do wish there was a standard extension (e.g. at the ethernet frame layer) for providing link aggregation hash overrides, though, so protocols like nfsd and samba could send some sort of link index in each frame and all the links along the way would distribute accordingly, ignoring the default hashing. right now the closest thing is SMB Multi-Channel but that involves linking two systems on two subnets which causes all sorts of other annoyances.
=> More informations about this toot | More toots from gsuberland@chaos.social
@gsuberland @electronic_eel What i wish there was is a standard for doing byte/block level striping for link aggregation. Like how four 10G lanes are bonded to make 40G or four 25G to make a 100G.
You'd need to have the PHYs tightly synchronized to enable line-code-level striping like that but it'd be so much more performant than 802.3ad.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @gsuberland @electronic_eel we don't talk about QSFP+ in non-cussing circles; 4*25 is fine though, as it has FEC per lane.
=> More informations about this toot | More toots from RichiH@chaos.social
@azonenberg @gsuberland @electronic_eel (I know you're using QSFP+ on your short hauls, but you're also using multimode; you like living on a knife's edge)
=> More informations about this toot | More toots from RichiH@chaos.social
@RichiH @gsuberland @electronic_eel No, I know my cable plant and my infrastructure and what it's capable of.
BER on all of my links is excellent, I've moved terabytes of data without a single FCS error logged on any of the ones I checked since the switch was last rebooted (last fall).
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @RichiH @gsuberland yeah, if it is properly installed, multimode is fine for inhouse stuff. All new things I install use singlemode, but I see no reason to replace a solid multimode run yet.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel @RichiH @gsuberland Also the larger core diameter of MMF makes it more tolerant to dust or scratches on the fiber faces.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@electronic_eel @RichiH @gsuberland (this was a significant design consideration for a cable plant that was going to be terminated in keystone boxes cut into drywall, the risk of particulate contamination was nontrivial)
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @electronic_eel @gsuberland yah, we had that discussion. In a very hands on, high introspection, short range, high knowledge, small scale situation it's fine. I do disagree that "resilience to scratches" is a design goal in fiber scenarios though.
Yet, it feels to me like FR4 likely feel to you.
=> More informations about this toot | More toots from RichiH@chaos.social This content has been proxied by September (ba2dc).Proxy Information
text/gemini