| I got the same problem. When implementing the exact method as described in quanta magazine (without looking at the arxiv paper), I always had estimates like 461746372167462146216468796214962164. Then after reading the arxiv paper, I got the the correct estimate, with this code (very close to mudiadamz's comment solution): import numpy as np
L = np.random.randint(0, 3900, 30557)
print(f"{len(set(L))=}")
thresh = 100
p = 1
mem = set()
for k in L:
if k in mem:
mem.remove(k)
if np.random.rand() < p:
mem.add(k)
if len(mem) == thresh:
mem = {m for m in mem if np.random.rand() < 0.5}
p /= 2
print(f"{len(mem)=} {p=} {len(mem)/p=}")
Or equivalently: import numpy as np
L = np.random.randint(0, 3900, 30557)
print(f"{len(set(L))=}")
thresh = 100
p = 1
mem = []
for k in L:
if k not in mem:
mem += [k]
if np.random.rand() > p:
mem.remove(k)
if len(mem) == thresh:
mem = [m for m in mem if np.random.rand() < 0.5]
p /= 2
print(f"{len(mem)=} {p=} {len(mem)/p=}")
Now I found the quanta magazine formulation problem. By reading:> Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1. we want to write: for k in L:
if k not in mem:
mem += [k]
else:
if np.random.rand() > p:
mem.remove(k)
if len(mem) == thresh:
mem = [m for m in mem if np.random.rand() < 0.5]
p /= 2
whereas it should be (correct): for k in L:
if k not in mem:
mem += [k]
if np.random.rand() > p: # without the else
mem.remove(k)
if len(mem) == thresh:
mem = [m for m in mem if np.random.rand() < 0.5]
p /= 2
Just this little "else" made it wrong! |
Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing: mem = [m for m in mem if np.random.rand() < 0.5]
["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]