| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eru 407 days ago

> However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?

Yes, of course.

You can fix this problem, however. There are (at least) two ways:

You can do an alternative interpretation and implementation of reservoir sampling: for each item you generate and store a random priority as it comes into the system. For each interval (eg each second) you keep the top k items by priority. If you want to aggregate multiple intervals, you keep the top k (or less) items over the intervals.

This will automatically deal with dealing all items the same, whether they arrived during busy or non-busy periods.

An alternative view of the same approach doesn't store any priorities, but stores the number of dropped items each interval. You can then do some arithmetic to tell you how to combine samples from different intervals; very similar to what's in the article.

> What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?

Anything you can do on any unbiased sample? Or are you talking about the specific variant in the article where you do reservoir sampling afresh each second?