Hacker News new | ask | show | jobs
by vrfcodf 3726 days ago
Here is an alternative:

Use a multiplicative linear congruential generator with a period 2^n , where n is ceil(log2(m)), where m is the size of the list.

Seed the generator, start generating values and drop all values that are larger than m. In the worst case you will have to drop half of the values, in the best case you will drop none. This depends on how close to a power of 2 m is.

The generator is very short, one line of code, and very fast: one multiplication, one addition, and one bitshift. Space complexity is O(1).

If your tweets are generated out from an integer value, then you can use this value to generate them. The tweets will be unique (assuming the tweet generator will generate unique tweets from unique values), and you will have m of them.

3 comments

This is very practical, but the result is not uniform; in fact it only returns a tiny fraction of the possible combinations.

There are "N choose k" combinations, which in the worst case (k = N/2) grows as ~ 2^N. The number of combinations quickly outstrips the number of possible states of the LCG, and most combinations will never be found. For example, we have a list of length 16 and wish to choose 8. We have a choice of 16 distinct seeds, so there are only 16 different combinations (out of 12870 possible) that we can get for any given LCG.

By randomly adding skips, can definitely get to all possible combinations even with only 16 seeds (since skips can be seeded from a much larger space), although there could be a speed implication.
Good point. You would get different sequences by choosing different constants for the generator though, but you couldn't cover all of them by a longshot.
This works, and is so much simpler.

It's not quite the same problem though; the algorithm described returns samples in sequential order. They mention performance reasons for this (C-f "tape sampling"). It also has the limitation that you need to know the sample size in advance.

MLCG is sequential. No value will be repeated within the period.
"sequential" in the sense that if returned values are i1, i2, ..., in, then 0 <= i1 < i2 < ... < in <= N.

A MLCG will return numbers all over the place; sorting them will take an additional O(n log n) time.

I see.
He means: outputs are sorted.
The generator is also highly predictable and suffers from defects that may affect the overall analysis.
This only matters in calculations where randomness is important, i.e. Monte Carlo method.

Nevertheless, if you don't use the least significant bits, and if the constants are carefully chosen, MLCG passes most of the hardest statistical tests. For example it passes all DIEHARD tests, and most of TESTU01.

In addition, the described algorithm "is on-line - that is, it requires no preprocessing and can generate each element of the sample in constant expected time." (Quoting from the original paper.) It needs only an upper bound on the number items, while your suggestion requires knowing the number beforehand.

You also need an MLCG over the range 0..2^64-1 to match the given algorithm. I believe that requires some extra code to handle potential overflow in a MLCG, so adds a few more lines to your estimate.