|
The way I'd describe it is: - You have a buffer. You initially try to fit every unique word on there.
- If the buffer gets full, you know that you can't fit all the unique words in there. So you decide to keep only a fraction, _p1_, of them. Run through the buffer, keep each value with probability _p1_.
- Keep adding new words, again only with probability _p1_.
- If the buffer gets full again, _p1_ was too large, so you choose a lower fraction, _p2_, retain existing elements only with probability _p2_/_p1_, and continue adding new words with probability _p2_.
- Every time the buffer gets full, you lower the faction of words you try to retain. The choice of using _pn_ = (1/2)^n is just easy for a computer, it only needs entire random bits at each step. What I _don't_ get is how this is correct for counting unique words. If I have a text of 16384 unique words, then I can accept that each will be in the final list with the same probability. But if I take that list and repeat the last word 30000 times, then it becomes overwhelmingly plausible that that word is in the final list, even though I haven't changed the number of unique words.
Maybe there is some statistical property that evens things out, but I couldn't see it from the article. |
If the input is known to be large, there is no reason to start by adding every element. It can treat the first round like any other, and start out with a _p0_ that is smaller than 1.