Ancestors

Written by AndresFreundTec on 2025-01-12 at 17:51

Sometimes I hate performance stuff.

How to quickly zero large amounts of memory differs rather vastly between cpu generations.

On both systems core and memory are bound to node 0 and the same mmap flags are used (MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE)

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2025-01-12 at 17:53

2x Xeon Gold 5215:

memzero_rep_movsb():  6.171 GB/s

1x_mm256_store_si256(): 7.100 GB/s

2x_mm256_store_si256():  10.685 GB/s

1x_mm256_stream_si256(): 6.988 GB/s

2x_mm256_stream_si256():  6.994 GB/s

...

2x Xeon Gold 6442Y:

memzero_rep_movsb():  11.155 GB/s

1x_mm256_store_si256():  28.981 GB/s

2x_mm256_store_si256():  10.034 GB/s

1x_mm256_stream_si256(): 29.154 GB/s

2x_mm256_stream_si256(): 29.155 GB/s

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Toot

Written by AndresFreundTec on 2025-01-12 at 17:53

I.e on Cascade Lake non-temporal stores suck and it's important to use more than one store per-loop. On Sapphire Rapids it's the other way round.

On both ERMSB is considerably worse than the alternatives.

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Descendants

Written by AndresFreundTec on 2025-01-12 at 17:58

Cascade lake has the added fun that using non-temporal stores on memory on another node is actually a lot faster than what's achievable on the local node.

nodebind: 0, membind: 0

1x_mm_stream_si128():  6.926 GB/s

nodebind: 0, membind: 1

1x_mm_stream_si128(): 20.693 GB/s

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by Paul Khuong on 2025-01-12 at 19:12

@AndresFreundTec did you mean rep movsb or rep stosb?

=> More informations about this toot | More toots from pkhuong@discuss.systems

Written by Paul Khuong on 2025-01-12 at 19:14

@AndresFreundTec also, you might see a small speed up with AVX-512 or even MOVDIR, just to avoid partial line stores.

=> More informations about this toot | More toots from pkhuong@discuss.systems

Written by AndresFreundTec on 2025-01-12 at 20:14

@pkhuong I did test AVX512 too, but it wasn't different in an interesting way. No perf difference on cascade lake. On SR, there's no difference with 1x, a bit faster with 2x/4x (but 2x/4x is much slower than 1x, to a degree that doesn't make any sense to me).

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by AndresFreundTec on 2025-01-12 at 20:10

@pkhuong I tried both, same performance. Probably should have include stosb, not movsb in the post.

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Written by Paul Khuong on 2025-01-12 at 20:15

@AndresFreundTec they do very different things

=> More informations about this toot | More toots from pkhuong@discuss.systems

Written by AndresFreundTec on 2025-01-13 at 15:16

@pkhuong But they can do very similar things, rep mov* copying from a zero buffer, rep stos* setting the memory to zero directly.

=> More informations about this toot | More toots from AndresFreundTec@mastodon.social

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113816660164027099
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 306.468611 milliseconds
Gemini-to-HTML Time: 1.9272 milliseconds

This content has been proxied by September (3851b).