Hacker News new | ask | show | jobs
by alxbrun 4696 days ago
I'm surprised no answer mentions the Kolmogorov complexity [1]. There's a fascinating link between how a given series seems "probable" to humans and its Kolmogorov complexity.

[1] http://en.wikipedia.org/wiki/Kolmogorov_complexity

1 comments

Humans are notoriously bad at generating random numbers. [1] Very random strings are described by very complex programs.

The human mind, taken as an algorithm, might prefer short programs that are fast to execute and require little energy to run.

Vitanyi proved that highly random strings must include repetition and patterns. Very crudely (and because I don't quite get all the maths):

A string-generating program is more random when its output is less predictable. If you don't have to worry about predicting patterns and repetitions, this makes the rest of the program output more predictable. Thus, highly random generators should output patterns and repetitions too.

So, if we take the lottery numbers to be randomly generated (this is important), we must expect patterns like 1111111111 too.

We can approach Kolmogorov complexity (not computable) through compression. Basic Run-Length Encoding (RLE) of 1111111111 gives us (10,1). This is a very short and simple program and it seems that humans prefer such programs over more complex ones.

If we don't want to share the lottery with people who look for patterns in previous draws, we should generate a string, that, when added to the previous draws and compressed produces the longest output.

Let's say the previous two draws were 1111111111 and 22222222222, or compressed with RLE (10,1),(10,2). Humans who will optimize for patterns will produce strings that result in a shorter size. Let's say some humans play the numbers of latest draw (222222222). That will result in a compressed string of all 3 draws: (10,1),(20,2). This has the exact same length as the compressed string of only 2 draws. The previous draws and the prediction share patterns that allow the compressor to produce smaller strings.

Optimizing for maximum complexity would have you pick a draw like "2453160798" (not possible to compress with RLE). A string like "1234567890" also does not compress with RLE, but it easy to spot the simple, short program generating it (n+1..).

What happens when some humans increasingly pick numbers that appear random, but in fact are very short programs or carry cultural relevance (a range starting at _n_th decimal of Pi, validation key of pirated windows XP, dates of birth of my first two children, my lucky 100-digit number).

If our compression algorithms were perfect they would be able to generate the shortest program that describes another, bigger (or equal size) string. But compression is not perfect. Most compression algo's would struggle to compress a range of decimals of pi at n, as something smaller than a completely random number string. There is a simple pattern, but it won't be found in a reasonable time.

Luckily we can use search engine indexes as our compression algorithm! If a string like "12345" has 1 million hits and string "67890" has 1 million hits and both strings together have 500k hits, we can say that these strings often co-occur (probably in the form 1234567890).

If a certain date carries cultural significance (1970-01-01), it would have more hits/results, than a date that carries zero, to little, significance (2641-10-02). If 8 is your lucky number, chances are (Chinese) people local to you, also have picked this as their lucky number.

Then the key to picking a less predictable sequence is to pick a sequence that doesn't appear (a lot) in the search engine index. With a little luck we don't even have to get the hits for previous draws, as a large search engine index is likely to already contain these draws.

So if we want to optimize for not sharing with a large amount of people, and we assume that the lottery numbers are highly random: pick number ranges that do not compress well with the previous draws and pick number ranges that have a low result count in a search engine like Google.

[1] Tasked to pick a random number between 1 and 4, there is a bias to picking 3. When between 1 and 10, there is a bias to picking 7. Why this is? 3 and 7 may appear more special than 6 (23) or 4 (22). Also the phrasing of picking _between_ 1 and 4 might throw of humans to NOT pick 1 or 4. A randomrange program doesn't care for such human subtleties.

Random passwords are harder to remember for humans (this might be because the "mind program" to generate them is complex and requires more energy). Human passwords often contain dictionary words and commonly used patterns (which are faster and cheaper to regenerate).