Hacker News new | ask | show | jobs
by stochastic_monk 2935 days ago
It's worth pointing out that his two favorite RNGs (xoroshiro128+/xorshift128+) both fail BigCrush. According to [0] and the associated github [1], for a statistically strong RNG which is still fast, AES-CTR or splitmix64/lehmer64 are probably your best bet, unless you have AVX512, in which case an SIMD-accelerated PCG is the way to go [2]. (The other methods cap out at 1 cycle per byte, while the AVX512 PCG is 1 cycle per 32-bit integer, 4x as fast as the fastest previously tested.) While I don't doubt it could be further accelerated, I've added STL compatibility, templated unrolling, and provided some extra utilities (including random access) in a package based off code from [1] (provided by Samuel Neves) which I now use in most of my projects, and which is available at [3].

[0] https://lemire.me/blog/2017/09/08/the-xorshift128-random-num...

[1] https://github.com/lemire/testingRNG

[2] https://lemire.me/blog/2018/06/07/vectorizing-random-number-...

[3] https://github.com/dnbaker/aesctr

4 comments

Not sure I think using AVX512 makes sense in an RNG unless you're already using those instructions for a lot of stuff. Using instructions from outside the current power bin will adjust the clock rate and frequency of the core, and doing so is very slow (measured in milliseconds -- voltage regulators need to adjust).

In general unless you have a lot of AVX512 code to run (several ms worth), you're usually better off avoiding those instructions IME :(. (The same is also true for many AVX2 instructions...)

(See https://en.wikichip.org/wiki/intel/frequency_behavior#Base.2... for some more info, as well as the Intel optimization guide).

The way the mentioned PRNGs "fail" is when testing just the lower bits (search for the occurrences of "lsb" in [1] above) and this may be important in your use cases or not. The same [1] claims in "Visual Summary" that the "cycles/byte" is 1 for various PRNGs but http://xoshiro.di.unimi.it/ seems to show that the reason splitmix64 is not preferred everywhere is that xoroshiro128+ is roughly two times faster than splitmix64 .

Regarding having lower bits poor statistically, it was known since forever that that is the case for the huge class of simple PRNGs (effectively all that are faster than the alternatives, unless maybe if there's some specialized instruction in the CPU), the question is if that is critical or not for your purposes. The author of xoroshiro128+ is of course aware of that issue, and he also writes:

"For general usage, one has to consider that its lowest bits have low linear complexity and will fail linearity tests; however, low linear complexity can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits (we computed a precise estimate of the linear complexity of the lowest bits)."

In short, if you don't know how you're going to use the PRNG and you don't have problems related to speed, sure, use the safest one. Note that "safety" is still differently understood in different use cases, e.g. take care to note that most of fast PRNGs still aren't cryptographically secure:

https://en.wikipedia.org/wiki/Cryptographically_secure_pseud...

and that sometimes even "standardized" "cryptographically secure" turn out to be something else, e.g. the subtitle on that wikipedia page:

"NSA kleptographic backdoor in the Dual_EC_DRBG PRNG"

Also other considerations come into play when you have some specific needs and you understand the consequences: then it's not only black-and-white "safe" v.s. "not safe." For some purposes (as the mentioned generation of the floating-point numbers in some use cases) speed matters enough to sacrifice some "perfectness."

> The same [1] claims in "Visual Summary" that the "cycles/byte" is 1 for various PRNGs but http://xoshiro.di.unimi.it/ seems to show that the reason splitmix64 is not preferred everywhere is that xoroshiro128+ is roughly two times faster than splitmix64 .

I have tested xoroshiro128+ vs splitmix64 in several procedural generation & simulation code bases in C and Swift. I could never confirm the numbers on http://xoshiro.di.unimi.it/. In fact, splitmix64 was slightly faster in all my tests with different optimizations enabled. I always assumed that's because its state only occupies a single register which certainly matters in practical applications (especially in C with its restricted calling conventions). I am not absolutely sure whether that was always the reason, though.

I use fast RNGs for kernel projections, FHT-accelerated JL transforms, and data generation for numerical experiments. I don’t need cryptographic security for these purposes.
And if you generate floating point numbers maybe you don’t have to worry about lsbs alone either? The “failed” tests drop away the bits that matter the most when floating point randomness is constructed.
I don’t know the details on which bits matter in the ziggurat algorithm, which is the one I use. Is this the case for all floating point random number generators?
The PRNGs you mentioned before all generate integers. The conversion from the integer PRNG to the floating point, and the conversion from one integer range to another range needed in the ziggurat algorithm both need to be done "right" to give correct results (I can imagine that even using splitmix64 the wrong implementation could be programmed by somebody not knowing what has to be done), so if you aren't sure about these steps you should surely check their quality yourself. If these are done right, I'd personally expect xoroshiro128+ could be "good enough" even when having that specific weakness (poorer quality when using only lsb bits) that you worried about. It's important, of course, not to drop the highest bits away.

And on another side, 2om3r questions the speed measurements of the author, and I think he has a point: the speed measurements should be made in the context of the real use, otherwise the compilers are able to "cheat" (optimize pieces of the code away) if the example is too simple.

If I recall correctly, Chacha20 is as fast as AES with AVX256, isn't it? For more speed, Chacha8 has yet to be broken. And I bet even Chacha2 would pass most statistical tests, although at that point it is not secure at all. Furthemore, we could ditch Chacha compatibility by skipping the de-interleaving step for more speed.

AES could likewise benefit from reduced rounds, but since its security margin is lower than Chacha, there's a chance it would perform a bit worse at the same quality level.

The linked code (https://github.com/lemire/testingRNG/blob/master/source/aesc...) uses Intel's AES instructions directly. Software AES would not be competitive with purpouse built PRNG algorithms speed-wise.
I never talked about software AES. I talked about AES-INI vs Chacha/AVX-256.

Chacha is very fast with vector instructions. Over 2.3GB per second on my core i5 skylake laptop.

I can't find a benchmark for chacha20 as a PRNG (I've only found benchmarks for salsa20...). Why don't you try the code from [1] and see how it compares to your chacha20 random number generator?
I have done such a benchmark when implementing my cryptographic library: https://monocypher.org/speed

Also look at BearSSL: https://www.bearssl.org/constanttime.html 2.4GB per second for AES-INI is comparable to my own measurements with AVS-256 Chacha20.

Chacha is slightly faster than Salsa, mostly because it removed some word shuffling Salsa needed for matrix transposition.

The commenter you're responding to doesn't need cryptographic security.
Even if you don't need security, stream ciphers are basically the gold standard when it comes to randomness.

Besides, they started it, talking about AES.

PCG is covered in sources [0-2] and my comment. It's great, but without AVX512, it's ~70% as fast as AES-CTR, lehmer64, splitmix64, and the xorshi*[0-9]\+plus families of RNGs, which in [1] were all roughly the same speed. In [2], Lemire's PCG implementation is even faster than his AVX512 version of xorshift128+.
The simple PCG example on the page outputs 32-bit numbers, using 64-bits of state and 64-bit arithmetic.

What is the recommended way to generate 64-bit numbers with PCG? Just generate two 32-bit numbers and stick them together? Or does that introduce bias or bad parformance?

Yes, it will introduce bias. Use the variants with larger internal state.

If you want to see why, you could play with small RNGs with 4 bits of state each, with 2-bit outputs. Then concatenate the outputs from each and check for uniformity.

For example, say the first RNG is given by the sequence (with the value of low 2 bits following the slash):

3/3, 13/1, 8/0, 11/3, 7/3, 15/3, 0/0, 2/2, 5/1, 10/2, 12/0, 9/1, 14/2, 1/1, 6/2, 4/0

and the second RNG is given by

12/0, 11/3, 15/3, 1/1, 10/2, 13/1, 4/0, 2/2, 14/2, 5/1, 8/0, 0/0, 9/1, 6/2, 7/3, 3/3.

Each have 16 unique states, and each of the four possible 2-bit outputs appear exactly 4 times in the output of a generator. So each generator is uniform.

Now create a 4-bit rng by concatenating 2-bit outputs from each generator: 3|0, 1|3, 0|3, 3|1, 3|2, 3|1, 0|0, 2|2, 1|2, 2|1, 0|0, 1|0, 2|1, 1|2, 2|3, 0|3.

This is all the 16 outputs we can get from the two generators with a period of 16 each, but you can already tell that some outputs appear more than once (0|0, 0|3, 1|2, 2|1, 3|1) and thus, obviously, there are others such as 0|1 or 0|2 that never appear!

For 4 bits of output, you really need a larger period. But even that does not guarantee uniformity when you're concatenating outputs from two independent RNGs. In fact the likelihood of getting uniformity by concatenating two random RNGs is practically nil.

On the other hand, for a single linear congruential generator, it is easy to guarantee uniformity by choosing the parameters according to the well known rules.

Sticking two 32 bit numbers (e.g. `(uint64_t(a) << 32) | uint64_t(b))` will not introduce bias.

IIRC, the PCG C++ distribution has a 64 bit variant (it uses 128 bit integers, which are implemented in software). I don't know if the performance is better or worse than calling the 32 bit variant twice.