Hacker News new | ask | show | jobs
by Majromax 768 days ago
Yes, you're supposed to throw it away.

The key insight is that words should appear in your scratch space with equal probability, no matter how often they appear in the source text. If you have a scratch space of size one, then the sequence of "apple" x 16 + "banana" x 1 should have equal chances of of the scratch space containing [apple] or [banana] at the end of the sequence, at least averaged over all 17 permutations.

One way to achieve this result is to make the scratch space memory-free. Rather than think of it as "remove the word with probability x", think of it as "always remove the word, then re-add it with probability (1-x)".

1 comments

Aaaah, remove and the re-add (maybe) makes a bit more sense..

So if it is not in the list you just add it, right? Actually is that right? Won’t the list fill up to the max again at some point like this?

So if so, I add Bernardo. Now the very next word is Bernardo so I remove the last Bernardo and maybe re-add it based on a 50% chance.