| Author here. I published most of these in one batch two years ago, and this is a relatively short time for compilers and libraries to catch up (when Daniel Lemire publishes something, it also usually takes a few years before it makes its way to the standard libraries, and he is much more involved in the downstreaming process than I am). In my opinion, the main challenges are: 1. Lack of suitably narrow abstractions. E.g., my "binary search" implementation requires building a small static structure (6-7% of the original array size), and although it is extremely hard to imagine a practical use case where you get a sorted array but can't spend linear time on its preprocessing, it's technically not a drop-in replacement for std::lower_bound. 2. Backwards compatibility. E.g., std::set has a requirement that a node deletion should not invalidate other iterators, which makes it harder to implement it as a B-tree, which has to move a lot of keys around after a node merge/split. 3. Performance regressions. Sometimes a change can make a program 2x faster in most use cases but 1.5x slower in some specific one. If the hardware handling that use case was already at 90% capacity, it will now start to fail after the upgrade, while a 2x improvement on other use cases is just "nice" and doesn't offset it. 4. Vagueness of "better". There are about 10 blog posts on the internet now claiming they have designed the fastest hash table in the world—and every one of them is right because they are using different benchmarks tailored to their specific data set and their specific hardware. 5. Desire to implement things more generically in the middle-end of a compiler instead of the standard library, which is much harder to do. You don't want to hand code the optimal SIMD procedure for calculating the sum of an array for each CPU microarchitecture; you want the compiler to do it automatically for everything that resembles a simple "for" loop. This also leads to a diffusion of responsibility, with compiler people and library maintainers arguing over the appropriate place for an optimization to be implemented. 6. Lack of incentives. Most people who can implement these optimizations work for big tech and would look better in their performance review by contributing to their employer's library (e.g., Abseil for Google, Folly for Meta), or at least to a library with a less Kafkaesque review process like Boost, rather than the standard library. 7. Things still being in the research stage. For example, I recently discovered (but haven't published yet) a new GCD algorithm that seems to yield another ~2x improvement over binary GCD (~4x over std::gcd), and so the guy who recently pushed it in libc++ has in a certain sense wasted work. I haven't rerun benchmarks myself, but I believe some relatively decoupled parts of the STL have actually since been upgraded in some compilers (std::lower_bound is now branchless, std::gcd now uses binary GCD, std::accumulate and similar reductions now use instruction-level parallelism when they see it) although in all these cases I didn't discover but at most only popularized them. |
Great job on this book. Lots of great content that can be used as reference.