Hacker News new | ask | show | jobs
by pytness 757 days ago
Is it me or is the description of the algo wrong?

    > Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word;
If i follow this description of "check if exists in list -> delete":

    if hash_set.contains(word) {
        if !keep_a_word(round) {
            hash_set.remove(word);
            continue;
        }
    } else {
        hash_set.insert(word.to_string());
    }
The algorithm runs for ~20 iterations:

    Total word count: 31955 | limit: 1000
    End Round: 20, word count: 737
    Unique word count: 7233
    Estimated unique word count: 772800512
But if I save the word first and then delete the same word:

    hash_set.insert(word.to_string());

    if !keep_a_word(round) {
        hash_set.remove(word);
        continue;
    }
It gets the correct answer:

    Total word count: 31955 | 1000
    End Round: 3, word count: 905
    Unique word count: 7233
    Estimated unique word count: 7240
2 comments

I got the same problem.

When implementing the exact method as described in quanta magazine (without looking at the arxiv paper), I always had estimates like 461746372167462146216468796214962164.

Then after reading the arxiv paper, I got the the correct estimate, with this code (very close to mudiadamz's comment solution):

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem =  set()  
    for k in L:
        if k in mem:
            mem.remove(k)
        if np.random.rand() < p:
            mem.add(k)
        if len(mem) == thresh:
            mem = {m for m in mem if np.random.rand() < 0.5}
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Or equivalently:

    import numpy as np
    L = np.random.randint(0, 3900, 30557)
    print(f"{len(set(L))=}")
    thresh = 100
    p = 1
    mem = []
    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
    print(f"{len(mem)=} {p=} {len(mem)/p=}")

Now I found the quanta magazine formulation problem. By reading:

> Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list. Proceed in this fashion until you have 100 words on the whiteboard. Then randomly delete about half again, based on the outcome of 100 coin tosses. That concludes Round 1.

we want to write:

    for k in L:
        if k not in mem:
            mem += [k]
        else:
            if np.random.rand() > p:
                mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
whereas it should be (correct):

    for k in L:
        if k not in mem:
            mem += [k]
        if np.random.rand() > p:    # without the else
            mem.remove(k)
        if len(mem) == thresh:
            mem = [m for m in mem if np.random.rand() < 0.5]
            p /= 2
Just this little "else" made it wrong!
Yes, there is an error in the Quanta article [at the same time, I must add that writing popular science articles is very hard, so it would be wrong to blame them]

Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing: mem = [m for m in mem if np.random.rand() < 0.5]

["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]

Quanta:

    Round 1. Keep going through Hamlet, adding new words as you go. If you come to a word that’s already on your list, flip a coin again. If it’s tails, delete the word; heads, and the word stays on the list.

To:

    Round 1. Keep going through Hamlet, but now flipping a coin for each word. If it’s tails, delete the word if it exists; heads, and add the word  if it's not already on the list.

Old edit:

    Round 1. Keep going through Hamlet, adding words but now flipping a coin immediately after adding it. If it’s tails, delete the word; heads, and the word stays on the list.
> adding words but now flipping a coin immediately after adding it

Edit: I thought your formulation was correct but not really:

We flip the coin after adding, but we also flip the coin even if we didn't add the word (because it was already there). This is subtle!

wrong:

    if k not in mem:
        mem += [k]
        if np.random.rand() > p:
            mem.remove(k)
wrong:

    if k not in mem:
        mem += [k]
    else:
        if np.random.rand() > p:
            mem.remove(k)
correct:

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)
correct:

    if k not in mem:
        mem += [k]
    if np.random.rand() > p:
        mem.remove(k)
The following is also not correct.

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)
Your final solution is indeed correct, and I think more elegant than what we had in our paper [I am one of the authors].
Ah, I'm using a set instead of list so I just always add and then toss remove.
Was just now solving it and came to see if others had the same issue. Yep, you are right.

    function generateRandomNumbers(c, n) {
      let randomNumbers = new Array(c);
      for (let i = 0; i < randomNumbers.length; i++) {
        let randomNumber = Math.floor(Math.random() * (n + 1));
        randomNumbers[i] = randomNumber;
      }
      return randomNumbers;
    }
    function run(w, wS, m, r) {
        function round(r) {
            while(wS.size < m) {
                const next = w.next()
                if (next.done) return true;
                wS.add(next.value)
                prune(next.value, r)
            }
            return false
        }
        function prune(v,r) {
            for (let i = 0; i < r; i++) {
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(v)
                }
            }
        }
        function purge(wS) {
            const copy = new Set(wS)
            copy.forEach(ith=>{
                const flip = new Boolean(Math.round(Math.random()))
                if (flip == false) {
                    wS.delete(ith)
                }
            })
        }
        const done = round(r);
        if (!done) {
            purge(wS)
            return run(w, wS, r+1,m)
        }
        console.log(`Round ${r} done. ${wS.size} Estimate: ${wS.size / (1/Math.pow(2,r))}`)
    }
    const memory = 1000
    const words = generateRandomNumbers(3000000,15000)
    const w = words[Symbol.iterator]() // create an iterator
    const wS = new Set();
    run(w,wS, memory,0);
Noticed an error;

    return run(w, wS, r+1,m)
Should be changed to:

    return run(w, wS, m, r+1)