Ancestors

Written by Electronic Eel on 2024-03-11 at 22:03

Back in December I got some Mellanox ConnectX-6 Dx, now I finally got around to playing with them. I got them because I was interested in two features:

[#]networking #mellanox #tls #homelab

=> View attached media

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-11 at 22:06

I hoped that TLS-offloading would increase throughput or at least keep throughput but reduce cpu load. But when you look at the throughput graph, the TLS-offloading (nginx+hw) is completely useless for small transfers. I'd have needed log charts to better show this, 8.7 MByte/s total for 500 clients repeatedly requesting a file of 10 kBytes. The regular userspace-only nginx can do 323 MBytes/s for the same load. Even with 100 kBytes requests it is still useless (83 MBytes/s).

It only becomes useful in the region somwhere between 1 and 10 MBytes file size.

While offloading TLS to the kernel (kTLS) has some setup cost, it pays off from shortly after 100k, offloading the transmission to the network card seems to be much slower. Since the CPU is nearly idle during this time it seems like setting up the offload is somehow implemented inefficiently.

=> View attached media | View attached media

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-11 at 22:06

Another problem I found is that the TLS hardware offload only seems to support TLS 1.2 ciphers. The datasheet from Mellanox/Nvidia claims "AES-GCM 128/256-bit key" and doesn't give more details.

It worked with the TLS 1.2 cipher ECDHE-RSA-AES256-GCM-SHA384. But as soon as I switched to TLS 1.3 and tried to use for example TLS_AES_256_GCM_SHA384, the kernel didn't use the hardware offload anymore. I'm not a crypto expert, but I'd say that encrypting the actual data after setting up the TLS session once should be the same for both. So it could be a kernel issue.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-11 at 22:07

Setting up the hardware offload was totally easy and painless:

ssl_conf_command Options KTLS; in nginx.conf and then ethtool -K tls-hw-tx-offload on; ethtool -K tls-hw-rx-offload on;, for both nics of the bond.

An easy way to verify is looking at /proc/net/tls_stat.

A website with some helpful info is https://delthas.fr/blog/2023/kernel-tls/ , although the info that only AES128 works seems to be outdated as I got AES256 to work without problems as long as I stayed in TLS1.2.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-11 at 22:07

As server I used an Epyc 7232P (8-core, Zen 2) with kernel 6.7.9 (elrepo ml), stock kernel mlx5 driver, nginx 1.22.1 and the rest is a stock Rocky 9.3 linux.

The client is a much faster Ryzen 7950X3D (16-core, Zen 4). I set it up like this to really benchmark the server side and not the client.

Both were equipped with a ConnectX-6 Dx card and connected with 2 LACP-bonded 25 GBit/s links, xmit hash layer3+4, so a theoretical bandwidth of 50 GBit/s. I did not use any TLS offloading on the client since it is much faster than the server and never really got warm during the tests.

I used static files on disk on the server, sendfile on, TLS session caching off, socket reuse on.

On the client I used wrk with --header "Connection: Close" to simulate many users downloading a single file and not a few downloading much.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-11 at 22:23

So I guess the way they implemented the TLS offload, it only makes sense when you have long-running connections that are used to transfer many Mega- or Gigabytes of data. So your regular webserver or fedi instance being hit by many clients downloading a picture doesn't profit.

But maybe I made some mistake in benchmarking or their closed source drivers are much better?

Let's hope the integrated switch function is more promising. I'll have to do some more reading on this it seems, it is based on Open vSwitch which I haven't used before.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-17 at 19:13

After experimenting with the (underwhelming) performance of HTTPS offload of my Mellanox/Nvidia ConnectX-6 Dx smart NIC last week, I had one more idea how the crypto offload could be useful to me: encrypting NFS on a fileserver.

My test was one client doing random reads over NFS with fio. Since the test with the many HTTPS clients last week didn't perform well probably due to setting up the TLS session being slow, I now did a test with just one client and one TCP connection.

As you can see in the chart, offloading NFS encryption with the new RPC-with-TLS mode works and is even the fastest NFS option on this server hardware - but it won't even saturate a 10G link and is far slower than the unencrypted variant.

As the card is also able to do IPSec crypto offloading I also tried running the unencrypted NFS through an IPSec tunnel for auth and encryption, but the performance of it is totally useless.

So unfortunately I'm still underwhelmed with the performance of the crypto offload.

[#]networking #nfs #mellanox

=> View attached media

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-17 at 19:13

Getting the NFS RPC-with-TLS encryption offloaded was a bit more tricky:

Either the kernel ktls doesn't currently support offloading TLS 1.3 at all or at least the Mellanox driver doesn't support it yet. The code in the mlx5 kernel driver makes it clear that it currently just supports TLS 1.2.

But NFS RPC-with-TLS is very new code, so it is only designed to work with TLS 1.3.

I had to make some dirty hacks to tlshd (the userspace daemon that initiates the TLS connection before handing it off to the kernel) to get it to work by forcing TLS 1.2 and a matching cipher on the client and server.

So this is probably not something you want to run in prod until the mlx5 driver gains TLS 1.3 offloading.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Electronic Eel on 2024-03-17 at 19:13

Since I now had just one TCP connection, LACP-bundling my two links became useless since the xmit_hash_policy=layer3+4 results to all packets being sent over the same link. So 25 GBit or roughly 3.1 GBytes/s was the theoretical limit.

I could see several nfsd kernel threads being used and being spread over different cores. So the NFS part profits from multiple cores. But probably the ktls part doesn't because all the data is stuffed into one TCP connection in the end. Maybe there is some path for future optimization? NFS RPC-with-TLS is very new code, so I have some hope that it's speed will improve in the future.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Graham Sutherland / Polynomial on 2024-03-17 at 19:19

@electronic_eel do you have hardware RSS queues configured? and does nfsd support RSS-aware pinned threads? this massively improved my CPU/IPC overheads when I was doing similar stuff in Samba.

=> More informations about this toot | More toots from gsuberland@chaos.social

Toot

Written by Graham Sutherland / Polynomial on 2024-03-17 at 19:25

@electronic_eel I do wish there was a standard extension (e.g. at the ethernet frame layer) for providing link aggregation hash overrides, though, so protocols like nfsd and samba could send some sort of link index in each frame and all the links along the way would distribute accordingly, ignoring the default hashing. right now the closest thing is SMB Multi-Channel but that involves linking two systems on two subnets which causes all sorts of other annoyances.

=> More informations about this toot | More toots from gsuberland@chaos.social

Descendants

Written by Andrew Zonenberg on 2024-03-17 at 19:29

@gsuberland @electronic_eel What i wish there was is a standard for doing byte/block level striping for link aggregation. Like how four 10G lanes are bonded to make 40G or four 25G to make a 100G.

You'd need to have the PHYs tightly synchronized to enable line-code-level striping like that but it'd be so much more performant than 802.3ad.

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Andrew Zonenberg on 2024-03-17 at 19:30

@gsuberland @electronic_eel In particular being able to stripe a single flow across two links and actually get double throughput.

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Electronic Eel on 2024-03-17 at 19:33

@azonenberg @gsuberland yeah that would be really nice to have.

I guess it would only work with two slots on one NIC or switch, since it would need proper sync between the links. But such a limitation would be something I could live with.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Andrew Zonenberg on 2024-03-17 at 19:34

@electronic_eel @gsuberland Yeah exactly. Being able to bond 2/4 adjacent ports in a standard way would be really nice.

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Joel Michael on 2024-03-17 at 23:03

@azonenberg @electronic_eel @gsuberland hm I’ll have to dig into the spec, but does it actually need to be synchronised? I mean, you could just spray and pray, and let TCP deal with any out of order frames. The only reason LACP does N-way hashing to send a single flow down a single link is to avoid out of order frames. You can do channel-bonding without LACP - Cisco calls it EtherChannel, and it’s what you get when you set “mode on” on a channel-group instead of “mode active” or “mode passive” which starts LACP.

=> More informations about this toot | More toots from jpm@aus.social

Written by Andrew Zonenberg on 2024-03-17 at 23:05

@jpm @electronic_eel @gsuberland No I'm talking about striping bytes (or well , line code symbols) not packets. Which is how 40/100G do it.

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Andrew Zonenberg on 2024-03-17 at 23:06

@jpm @electronic_eel @gsuberland As in it would simultaneously use multiple physical interfaces to send single packets. Like multilane PCIe.

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Joel Michael on 2024-03-18 at 00:39

@azonenberg @electronic_eel @gsuberland oof ok I’ll definitely need to read the 802.3 spec properly

=> More informations about this toot | More toots from jpm@aus.social

Written by Electronic Eel on 2024-03-17 at 23:06

@jpm @azonenberg @gsuberland if you let TCP deal with out-of-order packets, performance will suffer seriously. So unless you are able to convince your application to utilize multiple TCP sessions, any kind of bonding short of the symbol striping @azonenberg is talking about is far inferior.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Richard "RichiH" Hartmann on 2024-03-17 at 20:10

@azonenberg @gsuberland @electronic_eel we don't talk about QSFP+ in non-cussing circles; 4*25 is fine though, as it has FEC per lane.

=> More informations about this toot | More toots from RichiH@chaos.social

Written by Richard "RichiH" Hartmann on 2024-03-17 at 20:14

@azonenberg @gsuberland @electronic_eel (I know you're using QSFP+ on your short hauls, but you're also using multimode; you like living on a knife's edge)

=> More informations about this toot | More toots from RichiH@chaos.social

Written by Andrew Zonenberg on 2024-03-17 at 20:17

@RichiH @gsuberland @electronic_eel No, I know my cable plant and my infrastructure and what it's capable of.

BER on all of my links is excellent, I've moved terabytes of data without a single FCS error logged on any of the ones I checked since the switch was last rebooted (last fall).

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Electronic Eel on 2024-03-17 at 20:21

@azonenberg @RichiH @gsuberland yeah, if it is properly installed, multimode is fine for inhouse stuff. All new things I install use singlemode, but I see no reason to replace a solid multimode run yet.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Andrew Zonenberg on 2024-03-17 at 20:41

@electronic_eel @RichiH @gsuberland Also the larger core diameter of MMF makes it more tolerant to dust or scratches on the fiber faces.

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Andrew Zonenberg on 2024-03-18 at 07:12

@electronic_eel @RichiH @gsuberland (this was a significant design consideration for a cable plant that was going to be terminated in keystone boxes cut into drywall, the risk of particulate contamination was nontrivial)

=> More informations about this toot | More toots from azonenberg@ioc.exchange

Written by Richard "RichiH" Hartmann on 2024-03-18 at 07:39

@azonenberg @electronic_eel @gsuberland yah, we had that discussion. In a very hands on, high introspection, short range, high knowledge, small scale situation it's fine. I do disagree that "resilience to scratches" is a design goal in fiber scenarios though.

Yet, it feels to me like FR4 likely feel to you.

=> More informations about this toot | More toots from RichiH@chaos.social

Written by Electronic Eel on 2024-03-17 at 19:30

@gsuberland I have played with enforcing round-robin bonding in the past. Even when the machines are directly connected, I couldn't get any real speed improvement when having just one TCP connection. It seems even the small timing deviations between the links throw off the TCP window scaling etc.

So when I need one fast connection, I have resorted to upgrading the whole link to 40G or 100G instead of bonding. Bonding with LACP is still good for redundancy, especially when combined with MLAG on two different switches.

=> More informations about this toot | More toots from electronic_eel@treehouse.systems

Written by Graham Sutherland / Polynomial on 2024-03-17 at 19:31

@electronic_eel yup, I came out with the same answer. it's why I'm going 40G here.

=> More informations about this toot | More toots from gsuberland@chaos.social

Proxy Information
Original URL
gemini://mastogem.picasoft.net/thread/112112667237920416
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
419.309711 milliseconds
Gemini-to-HTML Time
10.663164 milliseconds

This content has been proxied by September (ba2dc).