PCG is covered in sources [0-2] and my comment. It's great, but without AVX512, it's ~70% as fast as AES-CTR, lehmer64, splitmix64, and the xorshi*[0-9]\+plus families of RNGs, which in [1] were all roughly the same speed. In [2], Lemire's PCG implementation is even faster than his AVX512 version of xorshift128+.
The simple PCG example on the page outputs 32-bit numbers, using 64-bits of state and 64-bit arithmetic.
What is the recommended way to generate 64-bit numbers with PCG? Just generate two 32-bit numbers and stick them together? Or does that introduce bias or bad parformance?
Yes, it will introduce bias. Use the variants with larger internal state.
If you want to see why, you could play with small RNGs with 4 bits of state each, with 2-bit outputs. Then concatenate the outputs from each and check for uniformity.
For example, say the first RNG is given by the sequence (with the value of low 2 bits following the slash):
Each have 16 unique states, and each of the four possible 2-bit outputs appear exactly 4 times in the output of a generator. So each generator is uniform.
Now create a 4-bit rng by concatenating 2-bit outputs from each generator: 3|0, 1|3, 0|3, 3|1, 3|2, 3|1, 0|0, 2|2, 1|2, 2|1, 0|0, 1|0, 2|1, 1|2, 2|3, 0|3.
This is all the 16 outputs we can get from the two generators with a period of 16 each, but you can already tell that some outputs appear more than once (0|0, 0|3, 1|2, 2|1, 3|1) and thus, obviously, there are others such as 0|1 or 0|2 that never appear!
For 4 bits of output, you really need a larger period. But even that does not guarantee uniformity when you're concatenating outputs from two independent RNGs. In fact the likelihood of getting uniformity by concatenating two random RNGs is practically nil.
On the other hand, for a single linear congruential generator, it is easy to guarantee uniformity by choosing the parameters according to the well known rules.
Sticking two 32 bit numbers (e.g. `(uint64_t(a) << 32) | uint64_t(b))` will not introduce bias.
IIRC, the PCG C++ distribution has a 64 bit variant (it uses 128 bit integers, which are implemented in software). I don't know if the performance is better or worse than calling the 32 bit variant twice.