Sometimes I hate performance stuff.
How to quickly zero large amounts of memory differs rather vastly between cpu generations.
On both systems core and memory are bound to node 0 and the same mmap flags are used (MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB | MAP_POPULATE)
=> More informations about this toot | More toots from AndresFreundTec@mastodon.social
2x Xeon Gold 5215:
..
memzero_rep_movsb(): 6.171 GB/s
1x_mm256_store_si256(): 7.100 GB/s
2x_mm256_store_si256(): 10.685 GB/s
1x_mm256_stream_si256(): 6.988 GB/s
2x_mm256_stream_si256(): 6.994 GB/s
...
2x Xeon Gold 6442Y:
memzero_rep_movsb(): 11.155 GB/s
1x_mm256_store_si256(): 28.981 GB/s
2x_mm256_store_si256(): 10.034 GB/s
1x_mm256_stream_si256(): 29.154 GB/s
2x_mm256_stream_si256(): 29.155 GB/s
=> More informations about this toot | More toots from AndresFreundTec@mastodon.social
I.e on Cascade Lake non-temporal stores suck and it's important to use more than one store per-loop. On Sapphire Rapids it's the other way round.
On both ERMSB is considerably worse than the alternatives.
=> More informations about this toot | More toots from AndresFreundTec@mastodon.social
Cascade lake has the added fun that using non-temporal stores on memory on another node is actually a lot faster than what's achievable on the local node.
nodebind: 0, membind: 0
1x_mm_stream_si128(): 6.926 GB/s
nodebind: 0, membind: 1
1x_mm_stream_si128(): 20.693 GB/s
=> More informations about this toot | More toots from AndresFreundTec@mastodon.social
text/gemini
This content has been proxied by September (ba2dc).