Ancestors

Written by AndresFreundTec on 2025-01-12 at 17:51

Sometimes I hate performance stuff.

How to quickly zero large amounts of memory differs rather vastly between cpu generations.

On both systems core and memory are bound to node 0 and the same mmap flags are used (MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE)

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2025-01-12 at 17:53

2x Xeon Gold 5215:

memzero_rep_movsb():  6.171 GB/s

1x_mm256_store_si256(): 7.100 GB/s

2x_mm256_store_si256():  10.685 GB/s

1x_mm256_stream_si256(): 6.988 GB/s

2x_mm256_stream_si256():  6.994 GB/s

...

2x Xeon Gold 6442Y:

memzero_rep_movsb():  11.155 GB/s

1x_mm256_store_si256():  28.981 GB/s

2x_mm256_store_si256():  10.034 GB/s

1x_mm256_stream_si256(): 29.154 GB/s

2x_mm256_stream_si256(): 29.155 GB/s

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2025-01-12 at 17:53

I.e on Cascade Lake non-temporal stores suck and it's important to use more than one store per-loop. On Sapphire Rapids it's the other way round.

On both ERMSB is considerably worse than the alternatives.

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Toot

Written by AndresFreundTec on 2025-01-12 at 17:58

Cascade lake has the added fun that using non-temporal stores on memory on another node is actually a lot faster than what's achievable on the local node.

nodebind: 0, membind: 0

1x_mm_stream_si128():  6.926 GB/s

nodebind: 0, membind: 1

1x_mm_stream_si128(): 20.693 GB/s

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Descendants

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113816678601140758
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 251.205489 milliseconds
Gemini-to-HTML Time: 1.295379 milliseconds

This content has been proxied by September (ba2dc).