Hacker News new | ask | show | jobs
by nwallin 688 days ago
It's actually really important that uniform_int_distribution is implementation defined. The 'right' way to do it on one architecture is probably not the right way to do it on a different architecture.

For instance, Apple's new CPUs has very fast division. A convenient and useful tool to implement uniform_int_distribution relies on using modulo. So the implementation that runs on Apple's new CPUs ought to use the modulo instructions of the CPU.

On other architectures, the ISA might not even have a modulo instruction. In this case, it's very important that you don't try to emulate modulo in software; it's much better to rely other more complicated constructs to give a uniform distribution.

C++ is also expected to run on GPUs. NVIDIA's CUDA and AMD's HIP are both implementations of C++. (these implementations are non-compliant given the nature of GPUs, but both they and the C++ standard's committee have a shared goal of narrowing that gap) In general, std::uniform_int_distribution uses loops to eliminate redundancies; the 'happy path' has relatively easily predicted branches, but they can and do have instances where the branch is not easily predicted and will as often as not have to loop in order to complete. Doing this on a GPU might be multiple orders of magnitude slower than another method that's better suited for a GPU.

Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard. It's a mistake that shouldn't be made again.

3 comments

But reproducibility is as important as performance for the vast majority of use cases, if these implementation-defined bits start to affect the observable outcomes. (That's why we define the required time complexity for many container-related functions but do not actually specify the exact algorithm; difference in Big-O time complexity is just large enough to be "observed".)

A common solution is to provide two versions of such features, one for the less reproducible but maximally performant version and another for common middle grounds that can be reproduced reasonably efficiently across many common platforms. In fact I believe `std::chrono` was designed in that way to sidestep many uncertainties in platform clock implementations.

> Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard.

What parts of the standard dictate a particular regex implementation? IIRC the performance issues are usually blamed on ABI compatibility constraints rather than the standard making a fast(er) implementation impossible.

Nobody is using standard library for high-performant random number implementations.