Hacker News new | ask | show | jobs
by DoctorOetker 1168 days ago
I understand what you say, and I don't disagree with your analysis for the collection of datapoints on nonintersecting spherical surfaces. (Although I do frown at this type of dataset, since distribution densities tend to peak at their peaks, the "in high enough dimensions everything lies on a sphere" is a misconception: it is only the radial histogram that peaks at a nonzero radius...).

Just thought that it might be worth pointing out that there are valid use cases of clustering with real data reference for each class available: consider pictures, to be inspectable it would be helpful to see a real picture instead of some blurry interpolated mess.

Would you mind providing a reference for the bolt vector quantization algorithm? It sounds interesting.

2 comments

> frown on that sort of dataset

That example was definitely contrived and designed to strongly illustrate the point. I'll counter slightly that non-peaky topologies aren't uncommon, but they're unlikely to look anything that would push KMedoids to a pathological state rather than just a slightly worse state ("worse" assuming that KMeans is the right choice for a given problem).

> worth pointing out .. data reference

Totally agreed. I hope my answer didn't come across as too negative. It's good work, and everyone else was talking about the positives, so I just didn't want to waste too much time echoing again that while getting the other points across.

> bolt reference

https://github.com/dblalock/bolt

They say as much in their paper, but they aren't the first vector quantization library by any stretch. Their contributions are, roughly:

1. If you're careful selecting the right binning strategy then you can cancel out a meaningful amount of discretization error.

2. If you do that, you can afford to choose parameters that fit everything nicely into AVX2 machine words, turning 100s of branching instructions into 1-4 instructions.

3. Doing some real-world tests to show that (1-2) matter.

Last I checked their code wasn't very effective for the places I wanted to apply it, but the paper is pretty solid. I'd replace it with a faster KMeans approximation less likely to crash on big data (maybe even initializing with KMedoids :) ), and if the thing you're quantizing is trainable with some sort of gradient update step then you should do a few optimization passes in the discretized form as well.

We talk exactly about clustering pictures in our blog post! https://ai.stanford.edu/blog/banditpam/