Hacker News new | ask | show | jobs
by bagavi 1169 days ago
Simple explaination of the general idea using nearest neighbor search as example:

1. You want to calculate the l2-nearest neighbor to y from a set of x1, x2, ... xn. Where each point is a d-dimensional vector.

2. Naive approach will take O(nd) computations. For very large d, say 1e4, this is very expensive

Multi-arm bandits idea:

3. Estimate the distance of y to each point (xi) by sampling only along log(d) dimensions.

4. Keep the nearest n/2 points (throw away the rest)

5. Repeat step 3 and 4 until you have one point left

6. The total run time = nlog(d) + nlog(d)/2 + nlog(d)/4.... = O (nlog(d)log(n))

This is a huge improvement over O(nd) for large d (which is usually the case today)

This paper (https://ar5iv.org/abs/1805.08321) from Stanford pioneered this idea for nearest neighbors and many such household problems. It also theoretically proves that the above approach gives the right answer and empirically gets huge (100x) speed boost on large datasets.

The k-medoid paper is based off of this work :)

2 comments

And the biggest caveats IMO:

- Like any other probabilistic clustering algorithm that gets is speed boost from just ignoring large chunks of the data (like MiniBatch KMeans), this will not take outliers into account very well or very often.

- They're advertising this as not just a better KMedoids implementation, but a better drop-in replacement for KMeans. For generic clustering tasks, sure, fine, whatever, but the metric the new paper is trying to optimize is different from the one KMeans uses, so if KMeans has the right metric for a given task then you'll be switching to a (maybe) faster algorithm that just computes the wrong result. The easiest example that comes to mind is a dataset of non-intersecting hollow spheres. KMeans will spit out (for some choice of NClusters) the sphere centers, and KMedoids will spit out sphere boundary points, decreasing performance on the far side of the sphere and potentially allowing classification jumping from one sphere to another.

Both of those things are just qualities you may or may not want, so KMedoids may be better purely because it has those biases and KMeans doesn't, but it's not totally uncommon to just want cluster centers minimizing some error and not care how you get there or how explainable they are (the bolt vector quantization algorithm comes to mind), where KMedoids would just be the wrong choice.

I understand what you say, and I don't disagree with your analysis for the collection of datapoints on nonintersecting spherical surfaces. (Although I do frown at this type of dataset, since distribution densities tend to peak at their peaks, the "in high enough dimensions everything lies on a sphere" is a misconception: it is only the radial histogram that peaks at a nonzero radius...).

Just thought that it might be worth pointing out that there are valid use cases of clustering with real data reference for each class available: consider pictures, to be inspectable it would be helpful to see a real picture instead of some blurry interpolated mess.

Would you mind providing a reference for the bolt vector quantization algorithm? It sounds interesting.

> frown on that sort of dataset

That example was definitely contrived and designed to strongly illustrate the point. I'll counter slightly that non-peaky topologies aren't uncommon, but they're unlikely to look anything that would push KMedoids to a pathological state rather than just a slightly worse state ("worse" assuming that KMeans is the right choice for a given problem).

> worth pointing out .. data reference

Totally agreed. I hope my answer didn't come across as too negative. It's good work, and everyone else was talking about the positives, so I just didn't want to waste too much time echoing again that while getting the other points across.

> bolt reference

https://github.com/dblalock/bolt

They say as much in their paper, but they aren't the first vector quantization library by any stretch. Their contributions are, roughly:

1. If you're careful selecting the right binning strategy then you can cancel out a meaningful amount of discretization error.

2. If you do that, you can afford to choose parameters that fit everything nicely into AVX2 machine words, turning 100s of branching instructions into 1-4 instructions.

3. Doing some real-world tests to show that (1-2) matter.

Last I checked their code wasn't very effective for the places I wanted to apply it, but the paper is pretty solid. I'd replace it with a faster KMeans approximation less likely to crash on big data (maybe even initializing with KMedoids :) ), and if the thing you're quantizing is trainable with some sort of gradient update step then you should do a few optimization passes in the discretized form as well.

We talk exactly about clustering pictures in our blog post! https://ai.stanford.edu/blog/banditpam/
Right! Though to clarify a few nits: we use successive elimination instead of successive halving. And we talk about the Maximum Inner Product Search problem (very similar to NN problem) in our followup work: https://ar5iv.org/abs/2212.07551