Ancestors

Toot

Written by AndresFreundTec on 2024-08-10 at 03:32

Uh. Huh. Is it expected that SMAP can lead to a noticeable reduction in speed of copying from kernel to userspace?

It seems to be tied to CPU caches to some degree.

Reading a large, cached, file from the kernel I get substantially higher throughput when aggressively reusing the "target" buffers. But there's very little difference if I boot with clearcpuid=smap.

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Descendants

Written by Vlastimil Babka on 2024-08-10 at 06:50

@AndresFreundTec aggressively reusing means there's a smaller pool of buffers, or a smaller buffer used in multiple calls vs single call? How large are the reads then? And no funny misalignment of the buffers I suppose?

=> More informations about this toot | More toots from vbabka@social.kernel.org

Written by AndresFreundTec on 2024-08-10 at 16:55

@vbabka The former - i.e. a small pool of userspace buffers used across multiple reads. Reads from 8kB to 128kB in the "original" case. The pool of userspace buffers is from 256k (always fast) to 16MB (always slow), but the point where things get slow seems to change between machines.

It's actually also visible when just doing one large read into a single userspace buffer. Somewhere between 1M and 16M performance drastically changes.

I found a way to repro it with fio:

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2024-08-10 at 17:03

@vbabka Repro it with fio without needing to reboot with clearcpuid=smap, that is.

io_uring's fixed buffers avoid the SMAP overhead:

numactl --physcpubind 5 --membind 0 \

fio --directory /srv/fio --size=8GiB --pre_read=1 --gtod_reduce=1 --filename read --invalidate=0 --rw read --buffered 1 --ioengine io_uring \

--name 1m-nofixed --bs=1M \

--name 1m-fixed --stonewall --fixedbufs --bs=1M \

--name 16m-nofixed --stonewall --bs=16M \

--name 16m-fixed --stonewall --fixedbufs --bs=16M

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2024-08-10 at 17:07

@vbabka

Without* clearcpuid=smap:

2x Xeon 6442Y:

1M buffers, nofixed: 11.6GiB/s

1M buffers, fixed: 13.1GiB/s

16M buffers, nofixed: 6710MiB/s

16M buffers, fixed: 10.2GiB/s

Ryzen 7840U:

1M buffers, nofixed: 18.6GiB/s

1M buffers, fixed: 18.3GiB/s

16M buffers, nofixed: 6710MiB/s

16M buffers, fixed: 11.6GiB/s

(yes, the 16MB buffers result happened to be exactly the same)

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2024-08-10 at 17:13

@vbabka

The real issue of course isn't with fio, it's with postgres. For large sequential scans postgres uses a small ring of buffers for each scan.

To a) be able to have enough IOs in flight for AIO and b) to allow multiple independently started scans to share the ringbuffer , it'd be good to have that ring be bigger than the current default of 256kB.

But the slowdown that starts somewhere after 1MB makes that unattractive.

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2024-08-10 at 17:47

@vbabka

With* clearcpuid=smap:

2x Xeon 6442Y:

1M buffers, nofixed: 14.0GiB/s

1M buffers, fixed: 13.1GiB/s

16M buffers, nofixed: 10.3GiB/s

16M buffers, fixed: 10.2GiB/s

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/112935615329828634
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 262.287217 milliseconds
Gemini-to-HTML Time: 2.214536 milliseconds

This content has been proxied by September (3851b).