The thing I often need the most but don't have is a private test machine with as many cpus as possible so I can do meaningful performance testing. For example, right now I want to test some refcount improvements but I lack a machine with enough cpus to do that which is really annoying.
=> More informations about this toot | More toots from brauner@mastodon.social
@brauner How many CPUs is that?
=> More informations about this toot | More toots from vegard@mastodon.social
@vegard north of 64
=> More informations about this toot | More toots from brauner@mastodon.social
@brauner it looks like we are both looking into scalability of reference counters in the Linux kernel. In my case it's in the scheduler+mm subsystems: https://lore.kernel.org/lkml/20241002010205.1341915-1-mathieu.desnoyers@efficios.com/
Are there specific reference counters which you suspect to be bottlenecks ?
=> More informations about this toot | More toots from DesnoyersMa@discuss.systems
@DesnoyersMa I'm confused this is a separate hazard pointer implementation from Boqun?
=> More informations about this toot | More toots from brauner@mastodon.social
@brauner Yes, this is a separate implementation. I've done a prototype implementation in userspace based on per-cpu HP slots, and then created a minimalistic port of that implementation to kernel-space.
=> More informations about this toot | More toots from DesnoyersMa@discuss.systems
@DesnoyersMa I'll try that on the big box I have, curious! Not about the mm side specifically, just the hp case in general for other uses.
=> More informations about this toot | More toots from axboe@fosstodon.org
@axboe Let me know how it goes. Note that if you run into limitations with my minimalistic implementation, there are various ways it can be improved to cover more use-cases (e.g. more hazard pointer slots per CPU, dynamically adjusting the per-CPU scan depth, scanning for HP ranges, ...). My approach is to enhance it only when use-cases require it.
=> More informations about this toot | More toots from DesnoyersMa@discuss.systems
@DesnoyersMa Sure will do. It's 512 thread box, I'll run 24/48/96/192/256/512/1024 threads and dump the numbers here for -git and -git + patched.
=> More informations about this toot | More toots from axboe@fosstodon.org
@DesnoyersMa Here's the quick run, 48..2048 threads. System is a 2x 9754. Not sure this is what you expected, but it's 100% reproducible. Ran the tests twice on both, separate boots, and it's consistent. Test is context_switch1_threads -t.
=> More informations about this toot | More toots from axboe@fosstodon.org
@axboe That's unexpected. I tested on a AMD EPYC 9654 96-Core Processor (2 sockets, 384 HW threads total) and got very different results. Perhaps we should share our kernel config by email.
=> More informations about this toot | More toots from DesnoyersMa@discuss.systems
@axboe scratch my previous comment. That's a 4.9x speedup (490%) for 192 threads ??
=> More informations about this toot | More toots from DesnoyersMa@discuss.systems
@DesnoyersMa Right, any of the Diff results are how much faster the patched kernel is compared to the stock one. So 192 threads, that's a +390% speedup, or 4.9x as fast.
=> More informations about this toot | More toots from axboe@fosstodon.org
@brauner I've got a few bigger boxes that you're welcome to get an account on for testing. Maybe that'll help until you get one sourced?
=> More informations about this toot | More toots from axboe@fosstodon.org This content has been proxied by September (ba2dc).Proxy Information
text/gemini