FWIW: That Geoff Langdale question about byte gathers in AVX2 lead to @lemire pointing to a 16-bit elements in 128-bit reg gather which uiCA predicts at less that 4 cycles/iteration (skylake) in simdjson. Great example of super clever table lookups murdering computation.
https://github.com/simdjson/simdjson/blob/3c0d032dedcc3c87d4ef726a2f7a3c2a26a738b8/include/simdjson/westmere/simd.h#L119
=> More informations about this toot | More toots from mbr@mastodon.gamedev.place
@lemire Stripped down version:
https://gcc.godbolt.org/z/41qd4GsM3
=> More informations about this toot | More toots from mbr@mastodon.gamedev.place
@mbr @lemire "is this FPGA?"
=> More informations about this toot | More toots from lritter@mastodon.gamedev.place
@lritter @lemire I haven't tried deciphering 'thintable' but a comment by Geoff seems to indicate that it's "the method of four Russians".
=> More informations about this toot | More toots from mbr@mastodon.gamedev.place This content has been proxied by September (ba2dc).Proxy Information
text/gemini