Toots for AndresFreundTec@mastodon.social account

Written by AndresFreundTec on 2025-01-12 at 17:58

Cascade lake has the added fun that using non-temporal stores on memory on another node is actually a lot faster than what's achievable on the local node.

nodebind: 0, membind: 0

1x_mm_stream_si128():  6.926 GB/s

nodebind: 0, membind: 1

1x_mm_stream_si128(): 20.693 GB/s

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2025-01-12 at 17:53

I.e on Cascade Lake non-temporal stores suck and it's important to use more than one store per-loop. On Sapphire Rapids it's the other way round.

On both ERMSB is considerably worse than the alternatives.

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2025-01-12 at 17:53

2x Xeon Gold 5215:

..

memzero_rep_movsb():  6.171 GB/s

1x_mm256_store_si256(): 7.100 GB/s

2x_mm256_store_si256():  10.685 GB/s

1x_mm256_stream_si256(): 6.988 GB/s

2x_mm256_stream_si256():  6.994 GB/s

...

2x Xeon Gold 6442Y:

memzero_rep_movsb():  11.155 GB/s

1x_mm256_store_si256():  28.981 GB/s

2x_mm256_store_si256():  10.034 GB/s

1x_mm256_stream_si256(): 29.154 GB/s

2x_mm256_stream_si256(): 29.155 GB/s

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2025-01-12 at 17:51

Sometimes I hate performance stuff.

How to quickly zero large amounts of memory differs rather vastly between cpu generations.

On both systems core and memory are bound to node 0 and the same mmap flags are used (MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE)

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2025-01-02 at 22:23

@axboe Doing concurrent direct IO into 1GB anonymous huge pages hits severe contention within bio_set_pages_dirty().

In my case this makes IO into 1GB pages max out at ~18GB/s, whereas 2MB huge pages reaches ~34GB/s (the hardware limit).

Is this even needed when anon huge pages are targeted?

https://gist.github.com/anarazel/304aa6b81d05feb3f4990b467d02dabc

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-11-26 at 17:16

Turns out

tc qdisc add dev lo root netem delay 70ms 10ms

makes things slower. Surprise.

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-11-26 at 17:15

When you wonder why regression tests are suddenly slower and then, after a decent bit of investigating, remember that you recently added a 70ms delay to localhost connections while debugging something else...

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-30 at 00:46

Of course that's just when one there's a new HW generation announced but not yet shipping.

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-30 at 00:45

The mac mini I was using for postgres CI died. I suspect the PSU. Too old for warranty. And I don't (yet) have the right screwdrivers to open it open. I guess I could put it under a drillpress... :/

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-29 at 18:05

(the context is Postgres CI)

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-29 at 18:05

Is there a way to get a license to run more than two macOS VMs on one host?

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-28 at 15:56

@twoscomplement IIRC Sparc was in practice also TSO.

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-23 at 10:16

Just gave a talk about improving postgres on NUMA systems:

https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-10-19 at 17:09

Does anybody have a recommendation for a decent read-heavy/only "OLTP-ish" database benchmark where the queries aren't just single-table point or range selects?

Would like to showcase some scalability problems + solution that I see with such queries, but right now it's mostly visible with queries I made up on my own. It's easy enough to make up workloads that show almost anything, so I'd rather use something designed by someone else.

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-08-10 at 03:32

Uh. Huh. Is it expected that SMAP can lead to a noticeable reduction in speed of copying from kernel to userspace?

It seems to be tied to CPU caches to some degree.

Reading a large, cached, file from the kernel I get substantially higher throughput when aggressively reusing the "target" buffers. But there's very little difference if I boot with clearcpuid=smap.

=> More informations about this toot | View the thread

Written by AndresFreundTec on 2024-08-02 at 20:23

Hrmpf. Getting correctable AER errors on my new workstation when utilizing pcie 5.0 nvme storage.

=> More informations about this toot | View the thread

=> This profile with reblog | Go to AndresFreundTec@mastodon.social account

Proxy Information
Original URL
gemini://mastogem.picasoft.net/profile/109382857469797266
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
301.87155 milliseconds
Gemini-to-HTML Time
3.928221 milliseconds

This content has been proxied by September (ba2dc).