| HN Mirror

Thanks for the pointer. I read the video transcript and agree with their premise that indirect calls are slow. The are several ways to proceed from there. One could simply inline FastMemcpy into a larger block of code, and basically hoist the dispatch up until its overhead is low enough.

Instead, what they end up doing is pessimizing memcpy so that it is not inlined, and even goes through another thunk call, and defers the cost of patching until your code is paged in (which could be in a performance or latency-sensitive area). Indeed their microbenchmark does not prove a real-world benefit, i.e. that the thunks and patching are actually less costly than the savings from dispatch. It falls into the usual trap of repeating something 100K times, which implies perfect prediction which would not be the case in normal runs.

Also, the detection logic is limited to rules known to the OS; certainly sufficient for detecting AVX-512, probably harder to do something like "is it an AVX-512 where compressstoreu or vpconflict are super slow". And certainly impossible to do something reasonable like "just measure how my code performs for several codepaths and pick the best", or "specialize my code for SVE-256 in Graviton3".

So, besides the portability issue, and actually pessimizing short functions (instead of just inlining them), this prevents you from doing several interesting kinds of dispatch. Caveat emptor.