Another interesting direction you can take reservoir sampling is instead of drawing a random number for each item (to see whether it replaces an existing item and which one), you generate a number from a geometric distribution telling you how many items you can safely skip before the next replacement.
That's especially interesting, if you can skip many items cheaply. Eg because you can fast forward on your tape drive (but you don't know up front how long your tape is), or because you send almost your whole system to sleep during skips.
For n items to sample from, this system does about O(k * log (n/k)) samples and skips.
Conceptually, I prefer the version of reservoir sampling that has you generate a fixed random 'priority' for each card as it arrives, and then you keep the top k items by priority around. That brings me to another related interesting algorithmic problem: selecting the top k items out of a stream of elements of unknown length in O(n) time and O(k) space. Naive approaches to reaching O(k) space will give you O(n log k) time, eg if you keep a min heap around.
What you can do instead is keep an unordered buffer of capacity up to 2k. As each item arrives, you add it to the buffer. When your buffer is full, you prune it to the top k element in O(k) with eg randomised quickselect or via median-of-medians. You do that O(2k) work every k elements for n elements total, given you the required O(n) = O(n * 2*k / k) runtime.
I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.
> I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.
If memory serves right, they don't do much about how you can efficiently support changes to your discrete probability distribution.
I appreciate the offer (and your contributions in the comments here!) but collaborations are very difficult for me atm. Most of the work I do on these posts I do when I can steal time away from other aspects of my life, which can sometimes take weeks. I wouldn’t be a dependable collaboration partner.
It is limited to integer weights only to make it easy to verify that the algorithm implements the requested distribution exactly. (See the test file in the same directory.)
You could probably restrict to rational numbers, and still verify? Languages like Python, Haskell, Rust etc have good support for arbitrary length rational numbers.
Each floating point number is also a rational number, and thus you could then restrict again to floating point afterwards.
Alias tables are neat and not super well known. We used to have an interview question around sampling from a weighted distribution (typical answer: prefix sum -> binary search) and I don’t think anyone produced this. I like the explanation in that blog. The way it was explained to me was first ‘imagine drawing a bar chart and throwing a dart at it, retrying if you miss. This simulates the distribution but runs in expected linear time’. Then you can describe how to chop up the bars to fit in the rectangle you would get if all weights were equal. Proof that the greedy algorithm works is reasonably straightforward.
I'm not actually sure this makes for a good interview question. Doesn't it mostly just test whether you've heard of the alias method?
Btw, a slightly related question:
Supposed you have a really long text file, how would you randomly sample a line? Such that all lines in the text file have the exactly same probability. Ideally, you want to do this without spending O(size of file) time preprocessing.
(I don't think this is a good interview question, but it is an interesting question.)
One way: sample random characters until you randomly hit a newline. That's the newline at the end of your line.
It’s a retired question (so I’m not really disagreeing that it wasn’t very good), and no one was expected to get the alias tables (if they did, just ask for updateable weights) and in fact there isn’t even much point in telling people about them as they can then get the impression they failed the interview. The point is more to get some kind of binary search and understanding of probability.
The Monte Carlo method you propose probably works for files where there are many short lines but totally fails in the degenerate case of one very long line. It also may not really work that well in practice because most of the cost of reading a random byte is reading a big block from disk, and you could likely scan such a block in ram faster than you could do the random read of the block from disk.
That's exactly the blog post that clicked when I put my alias method [0] together. Their other writing is delightful as well.
[0] https://github.com/hmusgrave/zalias It's nothing special, just an Array-of-Struct-of-Array implementation so that biases and aliases are always in the same cache line.
Does this method compose with itself? E.g. if I implement reservoir sampling in my service and then the log collector service implements reservoir sampling, is the result the same as if only the log collector implemented it?
Though I think it's only strictly true, if the intervals you sample over are the same. Eg they both sample some messages every second, and the all start their second-long intervals on the same nanosecond (or close enough).
I find it easier to reason about reservoir sampling in an alternative formulation: the article talks about flipping a random (biased) coin for each arrival. Instead we can re-interpret reservoir sampling as assigning a random priority to each item, and then keeping the items with the top k priority.
It's fairly easy to see in this reformulation whether specific combinations of algorithms would compose: you only need to think about whether they would still select the top k items by priority.
The second formulation sounds easier to use to adapt to specific use cases too: just bump the priority of a message based on your business rules to make it more likely that interesting events get to your log database.
You could do (category, random priority) and then do lexicographic comparison. That way higher categories always outrank lower categories.
But depending on what you need, you might also just do (random priority + weight * category) or so. Or you just keep separate reservoirs for high importance items and for everything else.
I would expect any way to get a truly fair sample from a truly fair sample would necessarily result in a truly fair sample. I can't imagine how it could possibly not.
In the first instance, every second we get a 'truly fair' random sample from all the messages in that second.
Going from there to eg a 'truly fair' random sample from all the messages in a minute is not trivial. And it's not even possible just from the samples, without auxiliary information.
Well done, I really like the animations and the explanation. Especially the case where it's a graph and we can drag ahead or click "shuffle 100 times"
One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from
a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.
Love your website’s design, I find all of interactivity, the dog character as an “audience”, and even the font/color/layout wonderful. Loved the article too!
Just noticed the physics simulator at the top is interactive. Then I was stacking squares on top of each other to see how tall I could make it, and started throwing things at it angry birds style. Fun stuff.
Something no one seems to have realised yet is that the hero simulation at the top of the page is using reservoir sampling to colour 3 of the shapes black.
So many nice touches that combine to be much more than the sum of the parts.
Doe's bandana is cool, your dogs must worship you for your commitment to them!
My only suggestion is a way to slow down or ^S the log to read the funny messages, since they were flying by so fast I could only get a glimpse, even with reservoir sampling.
Doe’s bandana is my attempt at tasteful solidarity and support. Glad you noticed it!
I did consider a pause button for the logs but it felt too unsubtle, and distracts from the content of the post. You could argue the log messages are already distracting, but I really wanted my own take on “reticulating splines.”
You're welcome! I think it's a beautiful palette, and I think people have come to associate me with it now so I don't think I'll ever change.
I view all of my posts using the various colour blindness filters in the Chrome dev tools during development, to make sure I'm not using any ambiguous pairings. I'm glad that effort made you feel welcome and able to enjoy the content fully.
However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?
For example, if I want to optimize my code, and I need to know which endpoints are using the most time across the entire fleet to optimize my total costs (CPU-seconds or whatever), this would be an inappropriate method to use, since endpoints that get bursty traffic would be disproportionally underrepresented compared to endpoints that get steady constant traffic. So I'd end up wasting my time working on endpoints that don't actually get a lot of traffic.
Or if I'm trying to plan capacity for different services, and I want to know how many nodes to be running for each service, services that get bursty traffic would be underrepresented as well, correct?
What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?
> However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?
Yes, of course.
You can fix this problem, however. There are (at least) two ways:
You can do an alternative interpretation and implementation of reservoir sampling: for each item you generate and store a random priority as it comes into the system. For each interval (eg each second) you keep the top k items by priority. If you want to aggregate multiple intervals, you keep the top k (or less) items over the intervals.
This will automatically deal with dealing all items the same, whether they arrived during busy or non-busy periods.
An alternative view of the same approach doesn't store any priorities, but stores the number of dropped items each interval. You can then do some arithmetic to tell you how to combine samples from different intervals; very similar to what's in the article.
> What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?
Anything you can do on any unbiased sample? Or are you talking about the specific variant in the article where you do reservoir sampling afresh each second?
Good question. I'm not sure how suitable this would be to then do statistical analysis on what remains. You'd likely want to try and aggregate at source, so you're considering all data and then only sending up aggregates to save on space/bandwidth (if you were at the sort of scale that would require that).
The use-case I chose in the post was more focusing on protecting some centralised service while making sure when you do throw things away, you're not doing it in a way that creates blind-spots (e.g. you pick a rate limit of N per minute and your traffic is inherently bursty around the top of the minute and you never see logs for anything in the tail end of the minute.)
A fun recent use-case you might have seen was in https://onemillionchessboards.com. Nolen uses reservoir sampling to maintain a list of boards with recent activity. I believe he is in the process of doing a technical write-up that'll go into more detail.
I just listened to your episode on fafo.fm a few days ago and recognised your handle and already knew the link was worth clicking. Your stuff is awesome!
The post looks interesting, but I'm still building my triangle castle so I haven't read it yet. This probably means I should read it later when I'm less distracted. Anyway, already love the layout and the playful visualisations.
Another interesting direction you can take reservoir sampling is instead of drawing a random number for each item (to see whether it replaces an existing item and which one), you generate a number from a geometric distribution telling you how many items you can safely skip before the next replacement.
That's especially interesting, if you can skip many items cheaply. Eg because you can fast forward on your tape drive (but you don't know up front how long your tape is), or because you send almost your whole system to sleep during skips.
For n items to sample from, this system does about O(k * log (n/k)) samples and skips.
Conceptually, I prefer the version of reservoir sampling that has you generate a fixed random 'priority' for each card as it arrives, and then you keep the top k items by priority around. That brings me to another related interesting algorithmic problem: selecting the top k items out of a stream of elements of unknown length in O(n) time and O(k) space. Naive approaches to reaching O(k) space will give you O(n log k) time, eg if you keep a min heap around.
What you can do instead is keep an unordered buffer of capacity up to 2k. As each item arrives, you add it to the buffer. When your buffer is full, you prune it to the top k element in O(k) with eg randomised quickselect or via median-of-medians. You do that O(2k) work every k elements for n elements total, given you the required O(n) = O(n * 2*k / k) runtime.
Another related topic is rendezvous hashing: https://en.wikipedia.org/wiki/Rendezvous_hashing
Tangentially related: https://www.keithschwarz.com/darts-dice-coins/ is a great write-up on the alias method for sampling from a discrete random distribution.