Hacker News new | ask | show | jobs
by midjji 1879 days ago
Recently noted that mt19937, mt19937_64 are much faster than the standard c++ random generator. They are also possibly better with regards to number distribution, but the performance difference is gigantic on clang. The standard 32 bit std::default_random_generator is almost 40x slower than the 64 bit mt19937_64 one across the board, from -o1 to -O3 march... etc.

As all of them are also faster than old school rand, its worth upgrading for the performance increase if nothing else.

2 comments

I think both the PCG family and the xoroshiro family are faster still.

https://prng.di.unimi.it

https://www.pcg-random.org

ETA: https://github.com/lemire/testingRNG

Does anyone know what's the current consensus about PCG vs xoroshiro. I remember that the xoroshiro author had several complaints about PCG but I don't know ifhe is right or if he was just being salty.
There's a bunch of analysis written by the PCG folks bashing Xoroshiro as well [1]. I think that for most normal use cases with randomized state, Xoroshiro is quite good. It only requires a single 64b add which is amazingly efficient.

[1] https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html

MT is pretty slow. There are fast variants (SFMT), but why use that when can have much better and much faster rngs?

https://rurban.github.io/dieharder/QUALITY.html

I don't know about all the other RNGs in this benchmark, but MT prepares a dense/contiguous array of random numbers. The whole computation can be SIMDified (often even by the compiler), but extraction is (unless changed explicitly) a single copy of a double.

If RNG speed matters and you need a lot of random numbers in succession (and I assume these two assumptions correlate strongly) editing MT to directly pull 4 or even 8 random numbers at once (i.e. a `__m256 rand_m256()` interface) is a huge performance gain.

wyrand (the top spot of the linked benchmark), doesn't have this benefit. The computation can't be SIMDified and the extraction is always a single value.

So I would take these benchmark with a grain of salt and take a closer look at the specific situation for any application where RNG speed really matters.

Wouldn't a SIMDified implementation of other RNGs speed them up by a comparable factor, making MT relatively slower again?
If the next number depends on the previous number (like with wyrand and many others) it can't be simdified.

You could simdify computation with different seeds tho, which might be fine for many purposes. However at that point it's also a custom implementation with a different number sequence than the non-simd version, which isn't the case with a SIMDified MT.

Also xoshiro generators are enough cheaper than MT that even if you couldn't simdify it, you could keep 4 and get run the 4 together in SIMD to get 4 outputs way cheaper than 1 SIMD MT.
It looks like xorshiro was SIMDified in a paper linked in abother post in this tree.
Very useful! Thanks! Aren't there AES related intrinsics for some architectures that can be really fast for rng? How do they compare?