| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by josephernest 804 days ago

I got the same problem.

When implementing the exact method as described in quanta magazine (without looking at the arxiv paper), I always had estimates like 461746372167462146216468796214962164.

Then after reading the arxiv paper, I got the the correct estimate, with this code (very close to mudiadamz's comment solution):

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem =  set()  
    for k in L:
        if k in mem:
            mem.remove(k)
        if np.random.rand() < p:
            mem.add(k)
        if len(mem) == thresh:
            mem = {m for m in mem if np.random.rand() < 0.5}
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Or equivalently:

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem = []
    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Now I found the quanta magazine formulation problem. By reading:

> Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

we want to write:

    for k in L:
        if k not in mem:
            mem += [k]
        else:
            if np.random.rand() > p:
                mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2

whereas it should be (correct):

    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:    # without the else
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2

Just this little "else" made it wrong!

2 comments

kuldeepmeel 803 days ago

Yes, there is an error in the Quanta article [at the same time, I must add that writing popular science articles is very hard, so it would be wrong to blame them]

Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing: mem = [m for m in mem if np.random.rand() < 0.5]

["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]

link

Alexanfa 804 days ago

Quanta:

    Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

To:

    Round 1. Keep going through Hamlet, but now flipping a coin for each word. If it’s tails, delete the word if it exists; heads, and add the word  if it's not already on the list.

Old edit:

    Round 1. Keep going through Hamlet, adding words but now flipping a coin immediately after adding it. If it’s tails, delete the word; heads, and the word stays on the list.

link

josephernest 804 days ago

> adding words but now flipping a coin immediately after adding it

Edit: I thought your formulation was correct but not really:

We flip the coin after adding, but we also flip the coin even if we didn't add the word (because it was already there). This is subtle!

wrong:

    if k not in mem:
        mem += [k]
        if np.random.rand() > p:
            mem.remove(k)

wrong:

    if k not in mem:
        mem += [k]
    else:
        if np.random.rand() > p:
            mem.remove(k)

correct:

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)

correct:

    if k not in mem:
        mem += [k]
    if np.random.rand() > p:
        mem.remove(k)

link

kuldeepmeel 803 days ago

The following is also not correct.

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)

Your final solution is indeed correct, and I think more elegant than what we had in our paper [I am one of the authors].

link

Alexanfa 804 days ago

Ah, I'm using a set instead of list so I just always add and then toss remove.

link