|
|
|
|
|
by eesmith
1291 days ago
|
|
> ... contribute any additional value? I gather from the linked-to video that binary-load-time selection has better run-time performance than init-at-first-call run-time dispatch, and doesn't have the tradeoff between performance and security. |
|
Instead, what they end up doing is pessimizing memcpy so that it is not inlined, and even goes through another thunk call, and defers the cost of patching until your code is paged in (which could be in a performance or latency-sensitive area). Indeed their microbenchmark does not prove a real-world benefit, i.e. that the thunks and patching are actually less costly than the savings from dispatch. It falls into the usual trap of repeating something 100K times, which implies perfect prediction which would not be the case in normal runs.
Also, the detection logic is limited to rules known to the OS; certainly sufficient for detecting AVX-512, probably harder to do something like "is it an AVX-512 where compressstoreu or vpconflict are super slow". And certainly impossible to do something reasonable like "just measure how my code performs for several codepaths and pick the best", or "specialize my code for SVE-256 in Graviton3".
So, besides the portability issue, and actually pessimizing short functions (instead of just inlining them), this prevents you from doing several interesting kinds of dispatch. Caveat emptor.