Back in December I got some Mellanox ConnectX-6 Dx, now I finally got around to playing with them. I got them because I was interested in two features:
[#]networking #mellanox #tls #homelab
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
I hoped that TLS-offloading would increase throughput or at least keep throughput but reduce cpu load. But when you look at the throughput graph, the TLS-offloading (nginx+hw) is completely useless for small transfers. I'd have needed log charts to better show this, 8.7 MByte/s total for 500 clients repeatedly requesting a file of 10 kBytes. The regular userspace-only nginx can do 323 MBytes/s for the same load. Even with 100 kBytes requests it is still useless (83 MBytes/s).
It only becomes useful in the region somwhere between 1 and 10 MBytes file size.
While offloading TLS to the kernel (kTLS) has some setup cost, it pays off from shortly after 100k, offloading the transmission to the network card seems to be much slower. Since the CPU is nearly idle during this time it seems like setting up the offload is somehow implemented inefficiently.
=> View attached media | View attached media
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Another problem I found is that the TLS hardware offload only seems to support TLS 1.2 ciphers. The datasheet from Mellanox/Nvidia claims "AES-GCM 128/256-bit key" and doesn't give more details.
It worked with the TLS 1.2 cipher ECDHE-RSA-AES256-GCM-SHA384. But as soon as I switched to TLS 1.3 and tried to use for example TLS_AES_256_GCM_SHA384, the kernel didn't use the hardware offload anymore. I'm not a crypto expert, but I'd say that encrypting the actual data after setting up the TLS session once should be the same for both. So it could be a kernel issue.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Setting up the hardware offload was totally easy and painless:
ssl_conf_command Options KTLS; in nginx.conf and then ethtool -K tls-hw-tx-offload on; ethtool -K tls-hw-rx-offload on;, for both nics of the bond.
An easy way to verify is looking at /proc/net/tls_stat.
A website with some helpful info is https://delthas.fr/blog/2023/kernel-tls/ , although the info that only AES128 works seems to be outdated as I got AES256 to work without problems as long as I stayed in TLS1.2.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
As server I used an Epyc 7232P (8-core, Zen 2) with kernel 6.7.9 (elrepo ml), stock kernel mlx5 driver, nginx 1.22.1 and the rest is a stock Rocky 9.3 linux.
The client is a much faster Ryzen 7950X3D (16-core, Zen 4). I set it up like this to really benchmark the server side and not the client.
Both were equipped with a ConnectX-6 Dx card and connected with 2 LACP-bonded 25 GBit/s links, xmit hash layer3+4, so a theoretical bandwidth of 50 GBit/s. I did not use any TLS offloading on the client since it is much faster than the server and never really got warm during the tests.
I used static files on disk on the server, sendfile on, TLS session caching off, socket reuse on.
On the client I used wrk with --header "Connection: Close" to simulate many users downloading a single file and not a few downloading much.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
So I guess the way they implemented the TLS offload, it only makes sense when you have long-running connections that are used to transfer many Mega- or Gigabytes of data. So your regular webserver or fedi instance being hit by many clients downloading a picture doesn't profit.
But maybe I made some mistake in benchmarking or their closed source drivers are much better?
Let's hope the integrated switch function is more promising. I'll have to do some more reading on this it seems, it is based on Open vSwitch which I haven't used before.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
After experimenting with the (underwhelming) performance of HTTPS offload of my Mellanox/Nvidia ConnectX-6 Dx smart NIC last week, I had one more idea how the crypto offload could be useful to me: encrypting NFS on a fileserver.
My test was one client doing random reads over NFS with fio. Since the test with the many HTTPS clients last week didn't perform well probably due to setting up the TLS session being slow, I now did a test with just one client and one TCP connection.
As you can see in the chart, offloading NFS encryption with the new RPC-with-TLS mode works and is even the fastest NFS option on this server hardware - but it won't even saturate a 10G link and is far slower than the unencrypted variant.
As the card is also able to do IPSec crypto offloading I also tried running the unencrypted NFS through an IPSec tunnel for auth and encryption, but the performance of it is totally useless.
So unfortunately I'm still underwhelmed with the performance of the crypto offload.
[#]networking #nfs #mellanox
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Getting the NFS RPC-with-TLS encryption offloaded was a bit more tricky:
Either the kernel ktls doesn't currently support offloading TLS 1.3 at all or at least the Mellanox driver doesn't support it yet. The code in the mlx5 kernel driver makes it clear that it currently just supports TLS 1.2.
But NFS RPC-with-TLS is very new code, so it is only designed to work with TLS 1.3.
I had to make some dirty hacks to tlshd (the userspace daemon that initiates the TLS connection before handing it off to the kernel) to get it to work by forcing TLS 1.2 and a matching cipher on the client and server.
So this is probably not something you want to run in prod until the mlx5 driver gains TLS 1.3 offloading.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Since I now had just one TCP connection, LACP-bundling my two links became useless since the xmit_hash_policy=layer3+4 results to all packets being sent over the same link. So 25 GBit or roughly 3.1 GBytes/s was the theoretical limit.
I could see several nfsd kernel threads being used and being spread over different cores. So the NFS part profits from multiple cores. But probably the ktls part doesn't because all the data is stuffed into one TCP connection in the end. Maybe there is some path for future optimization? NFS RPC-with-TLS is very new code, so I have some hope that it's speed will improve in the future.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
My guess is that the reason for the limited speed with the crypto offloading of my ConnectX-6 Dx is that the IC on the network card runs into it's limit in encryption speed for one connection or TLS flow. Combine that with the time consuming setup of each connection/TLS flow, and the usefulness of the whole idea gets smaller and smaller.
Or is this something they improved with the next gen ConnectX-7? I haven't seen Mellanox/Nvidia post any figures about crypto offload speed...
Does anybody know of some figures or has tested it? Maybe @manawyrm ?
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Since I was already benchmarking this topic, I thought about what the best way to improve encrypted fileserver performance would be.
So I replaced my slowish Epyc 7232P (8-core, Zen 2) with another Ryzen 7950X3D (16-core, Zen 4) like on my client.
As you can see investing in a faster CPU gives much better results than the crypto offloading and either NFS RPC-with-TLS or Samba become reasonable options.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
thanks to @gsuberland for pointing me to his blog post about multichannel configuration on Samba - this triggered me researching if there is something similar for NFS - and it is: the nconnect= mount option.
And as you can see in the chart below it makes some difference...
In this scenario the hw crypto offload gets you quite near to the unencrypted performance. Also the bonded channels are properly utilized too. The only issue is that your application on the client has to send multiple requests in parallel to make use of this.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@gsuberland I tried to get this to work with samba too, but it seems like the samba multichannel implementation sticks to just one TCP connection per network card/link (two in my case), so it doesn't scale by available cores and the performance isn't really boosted in comparison to the regular singlechannel config.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
I'm continuing with my tests of the Mellanox/Nvidia ConnectX-6 Dx cards I got.
Over the last days I tested the integrated switching apabilities, called ASAP² by Mellanox. So what is it good for? Providing fast network access to virtual machines.
The conventional methods how this is done for KVM-based VMs on Linux is either a kernel bridge device, MacVTap or regular routing. But as you can see in mybenchmark graph, these methods have quite severe performance limits.
The alternative is Single Root I/O Virtualization (SR-IOV). A network adapter with this capability (nearly all server adapters offer this today) splits out several "virtual function" PCIe-devices. Then the virtual function devices
are made directly accessible to the VMs with the IOMMU of the CPU. It is faster because now the CPU doesn't have to do context switching between the VM and host.
While SR-IOV is available for quite some time now (it was first introduced around 2008), the implementations often had a few downsides:
ports and SR-IOV virtual functions together. This switch can be controlled with the kernel
switchdev interface and it is able to apply complex switching rules.
As you can see in the light-blue line in my benchmark graph, it is able to do LACP bonding of two physical ports and applying a layer3+4 xmit_hash_policy to utilize both bonded ports. So a VM is hooked up with just one virtual function and doesn't have to care about bonding at all. If either port of the bond is disconnected, the other one is used (I tested this to be really sure it is supported).
This is quite a good feature and something I haven't seen from other vendors.
[#]networking #mellanox #homelab #virtualization
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
Controlling the switch is done with the kernel switchdev driver and the devlink and tc tools. Basic rules like VLAN-tagging are supported of course, but you can also do more complex things like L3 routing and routing based on TCP port numbers. So you could for example take one IPv4 and divide it among several VMs based on port numbers.
tc and devlink are the more barebones interface to this. In their manual they suggest to use Open vSwitch to manage this. What it does is quite clever: it implements a quite capable software switch with the OpenFlow rule language, a management process and it's own small database backend. Packets are sent to this software switch first and (slowly) switched in software according to the rules you set.
When this first packet is forwarded, the management process also calculates the minimal rules that were necessary to forward this packet and subsequent similar ones. Then it creates a tc rule to offload this to hardware, so the following packets are switched purely in hardware. This ensures that only the rules that are actually used right now are configured on the switch ASIC, reducing bloat on the ASIC and improving switching speed.
Downside is that Open vSwitch and OpenFlow introduce an extra layer and complexity that has to be managed and understood. There seems to be a Ansible collection to manage Open vSwitch, but I didn't see
an easy way to use it to manage complex OpenFlow rules. But maybe I missed it because I just had a short look at it.
[#]openvswitch
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
While researching this I found out that Intel also offers something similar with their "eSwitch" feature of their E810 cards. Since I had such a card on hand I also tried it out:
It offers a switchdev driver and rules just like the Mellanox card and you can use it with or without Open vSwitch to control the connection to your VMs and virtual functions. They also offer advanced rules and actions like VLAN-tagging and filtering on port numbers etc. The performance is even a few single-digit MBytes/s better than on the Mellanox card.
But there is one important limitation: virtual functions are tightly bound to one physical port. You can't bond the physical ports together or add them both into one big bridge. The driver complains and errors out when you try to do that. They also explain this limitation in their readme.
While they claim that their cards are highly flexible and can be reconfigured with firmware (they call it "DDP"), I'm not sure if this limitation is something that they can work around with software/gateware in the future or if it is a hard limitation of their ASIC.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
So all in all I think the ConnectX-6 is a good card to use when you want to set up a virtualization server that is hooked up to more than just a gigabit port. When you have two LACP-bonded ports for increasing bandwith and/or reliability, the internal switch is quite a unique feature you really want. If you just have one upstream port, the cheaper Intel
E810 could also fit.
I plan to use one card in a server I want to put in colocation. But one thing I think I have to figure out first is how to best manage the additional sofware complexity of either devlink/tc or Open vSwitch. This is such a core part of a VM setup that it really must be reliable and you have to feel confident in the solution you choose.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel Haven't worked with the HW crypto acceleration yet, sorry :(
I'm mostly pushing customers packets into VMs, application layer is their problem 😺
=> More informations about this toot | More toots from manawyrm@chaos.social
@manawyrm wouldn't the customers be able to use the crypto offloading via their mlx5 virtual function NICs? Or do you not expose those to the customers to be more flexible regarding which hw you use?
But thanks for your reply anyway.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel They would be able to use them through VFs, but we can‘t use those as we (sadly) have many different NICs across the fleet and also because the network setup isn‘t just plain Ethernet.
=> More informations about this toot | More toots from manawyrm@chaos.social
@electronic_eel do you have hardware RSS queues configured? and does nfsd support RSS-aware pinned threads? this massively improved my CPU/IPC overheads when I was doing similar stuff in Samba.
=> More informations about this toot | More toots from gsuberland@chaos.social
@gsuberland I haven't done any tuning yet and just used default settings. So this could indeed be an area for further experimentation.
Thanks for the hint!
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel I wrote some stuff about it for samba, might be useful for figuring out nfsd
https://codeinsecurity.wordpress.com/2020/05/18/setting-up-smb-multi-channel-between-freenas-or-any-bsd-linux-and-windows-for-20gbps-transfers/
=> More informations about this toot | More toots from gsuberland@chaos.social
@electronic_eel I do wish there was a standard extension (e.g. at the ethernet frame layer) for providing link aggregation hash overrides, though, so protocols like nfsd and samba could send some sort of link index in each frame and all the links along the way would distribute accordingly, ignoring the default hashing. right now the closest thing is SMB Multi-Channel but that involves linking two systems on two subnets which causes all sorts of other annoyances.
=> More informations about this toot | More toots from gsuberland@chaos.social
@gsuberland @electronic_eel What i wish there was is a standard for doing byte/block level striping for link aggregation. Like how four 10G lanes are bonded to make 40G or four 25G to make a 100G.
You'd need to have the PHYs tightly synchronized to enable line-code-level striping like that but it'd be so much more performant than 802.3ad.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@gsuberland @electronic_eel In particular being able to stripe a single flow across two links and actually get double throughput.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @gsuberland yeah that would be really nice to have.
I guess it would only work with two slots on one NIC or switch, since it would need proper sync between the links. But such a limitation would be something I could live with.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel @gsuberland Yeah exactly. Being able to bond 2/4 adjacent ports in a standard way would be really nice.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @electronic_eel @gsuberland hm I’ll have to dig into the spec, but does it actually need to be synchronised? I mean, you could just spray and pray, and let TCP deal with any out of order frames. The only reason LACP does N-way hashing to send a single flow down a single link is to avoid out of order frames. You can do channel-bonding without LACP - Cisco calls it EtherChannel, and it’s what you get when you set “mode on” on a channel-group instead of “mode active” or “mode passive” which starts LACP.
=> More informations about this toot | More toots from jpm@aus.social
@jpm @electronic_eel @gsuberland No I'm talking about striping bytes (or well , line code symbols) not packets. Which is how 40/100G do it.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@jpm @electronic_eel @gsuberland As in it would simultaneously use multiple physical interfaces to send single packets. Like multilane PCIe.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @electronic_eel @gsuberland oof ok I’ll definitely need to read the 802.3 spec properly
=> More informations about this toot | More toots from jpm@aus.social
@jpm @azonenberg @gsuberland if you let TCP deal with out-of-order packets, performance will suffer seriously. So unless you are able to convince your application to utilize multiple TCP sessions, any kind of bonding short of the symbol striping @azonenberg is talking about is far inferior.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@azonenberg @gsuberland @electronic_eel we don't talk about QSFP+ in non-cussing circles; 4*25 is fine though, as it has FEC per lane.
=> More informations about this toot | More toots from RichiH@chaos.social
@azonenberg @gsuberland @electronic_eel (I know you're using QSFP+ on your short hauls, but you're also using multimode; you like living on a knife's edge)
=> More informations about this toot | More toots from RichiH@chaos.social
@RichiH @gsuberland @electronic_eel No, I know my cable plant and my infrastructure and what it's capable of.
BER on all of my links is excellent, I've moved terabytes of data without a single FCS error logged on any of the ones I checked since the switch was last rebooted (last fall).
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @RichiH @gsuberland yeah, if it is properly installed, multimode is fine for inhouse stuff. All new things I install use singlemode, but I see no reason to replace a solid multimode run yet.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel @RichiH @gsuberland Also the larger core diameter of MMF makes it more tolerant to dust or scratches on the fiber faces.
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@electronic_eel @RichiH @gsuberland (this was a significant design consideration for a cable plant that was going to be terminated in keystone boxes cut into drywall, the risk of particulate contamination was nontrivial)
=> More informations about this toot | More toots from azonenberg@ioc.exchange
@azonenberg @electronic_eel @gsuberland yah, we had that discussion. In a very hands on, high introspection, short range, high knowledge, small scale situation it's fine. I do disagree that "resilience to scratches" is a design goal in fiber scenarios though.
Yet, it feels to me like FR4 likely feel to you.
=> More informations about this toot | More toots from RichiH@chaos.social
@gsuberland I have played with enforcing round-robin bonding in the past. Even when the machines are directly connected, I couldn't get any real speed improvement when having just one TCP connection. It seems even the small timing deviations between the links throw off the TCP window scaling etc.
So when I need one fast connection, I have resorted to upgrading the whole link to 40G or 100G instead of bonding. Bonding with LACP is still good for redundancy, especially when combined with MLAG on two different switches.
=> More informations about this toot | More toots from electronic_eel@treehouse.systems
@electronic_eel yup, I came out with the same answer. it's why I'm going 40G here.
=> More informations about this toot | More toots from gsuberland@chaos.social This content has been proxied by September (ba2dc).Proxy Information
text/gemini