Hacker News new | ask | show | jobs
by seccode 610 days ago
Also, I am testing different ranges of digits other than first 10,000, but the problem with other ranges is that the distribution of digits is highly imbalanced and the model is not showing statistical significance, but models have a harder time when the distribution of classes is not 50/50, so I think its not quite fair to evaluate the model on these ranges.

So why do you think the first 10,000 digits are somewhat predictable?

2 comments

The distribution of digits is 'highly imbalanced' because that's what random distributions look like. I'll randomly select the digits 0-9 for 10,000 times and show the distribution, then do the same with the first 10,000 digits of pi, then do the random distribution again:

  >>> import random
  >>> from collections import Counter
  >>> ctr = Counter(random.choice(range(10)) for i in range(10_000))
  >>> for digit, count in ctr.most_common():
  ...   print(f"{digit}: {count}")
  ...
  2: 1039
  4: 1035
  0: 1031
  7: 1022
  3: 1008
  6: 998
  1: 976
  5: 973
  9: 963
  8: 955
  >>> pi_ctr = Counter(open("1-10000.txt").read().rstrip())
  >>> for digit, count in pi_ctr.most_common():
  ...   print(f"{digit}: {count}")
  ...
  5: 1046
  1: 1026
  2: 1021
  6: 1021
  9: 1014
  4: 1012
  3: 974
  7: 970
  0: 968
  8: 948
  >>> ctr = Counter(random.choice(range(10)) for i in range(10_000))
  >>> for digit, count in ctr.most_common(): print(f"{digit}: {count}")
  ...
  8: 1060
  2: 1048
  0: 1034
  4: 1026
  5: 1025
  3: 979
  7: 977
  6: 960
  1: 956
  9: 935
You can see that the distribution of pi's first 10,000 digits is what one should expect for a random distribution. If your method requires a 50/50 distribution then it cannot be used for this purpose.

Also, you are thinking about it wrong. The first 10,000 digits of pi are perfectly predictable.

I'm not predicting the number I'm predicting number%2==0. The model predicted better than the distribution probability
It doesn't really matter. There are 4970 even digits and 5030 odd digits in the first 10,000. Predicting all odds gives you a better-than-even chance of being right.

What does "highly unbalanced" mean?

How often will a random sequence be "highly unbalanced"?

How many people used another model, found no pattern, and never reported it?

You have plenty of data to work with. Try the second 10,000, the third 10,000 and so on.

Keep clear in your mind that a lot of people worked on this problem, including trained mathematicians. It is far more likely that you do not fully understand what you are doing than that they are wrong. Believing otherwise is the path of crankdom.

Better to use statistical significance tests to talk about what is "far more likely"
It doesn't predict better than even, it predicts better than the distribution probability
Because "somewhat predictable" doesn't mean "non-random". In fact, almost all prefixes of algorithmically random bit sequences are somewhat predictable with an appropriate definition of "somewhat", because you can find and exploit an accidental bias from taking the prefix and any such bias translates to some predictability.

(Another possibility is that, since pi itself is not really algorithmically random, the classifier was somehow able to learn how to partially compute pi! That's another pitfall you need to avoid even when you have a good understanding of information theory and statistics...)