|
|
|
|
|
by DoctorOetker
1168 days ago
|
|
I understand what you say, and I don't disagree with your analysis for the collection of datapoints on nonintersecting spherical surfaces. (Although I do frown at this type of dataset, since distribution densities tend to peak at their peaks, the "in high enough dimensions everything lies on a sphere" is a misconception: it is only the radial histogram that peaks at a nonzero radius...). Just thought that it might be worth pointing out that there are valid use cases of clustering with real data reference for each class available: consider pictures, to be inspectable it would be helpful to see a real picture instead of some blurry interpolated mess. Would you mind providing a reference for the bolt vector quantization algorithm? It sounds interesting. |
|
That example was definitely contrived and designed to strongly illustrate the point. I'll counter slightly that non-peaky topologies aren't uncommon, but they're unlikely to look anything that would push KMedoids to a pathological state rather than just a slightly worse state ("worse" assuming that KMeans is the right choice for a given problem).
> worth pointing out .. data reference
Totally agreed. I hope my answer didn't come across as too negative. It's good work, and everyone else was talking about the positives, so I just didn't want to waste too much time echoing again that while getting the other points across.
> bolt reference
https://github.com/dblalock/bolt
They say as much in their paper, but they aren't the first vector quantization library by any stretch. Their contributions are, roughly:
1. If you're careful selecting the right binning strategy then you can cancel out a meaningful amount of discretization error.
2. If you do that, you can afford to choose parameters that fit everything nicely into AVX2 machine words, turning 100s of branching instructions into 1-4 instructions.
3. Doing some real-world tests to show that (1-2) matter.
Last I checked their code wasn't very effective for the places I wanted to apply it, but the paper is pretty solid. I'd replace it with a faster KMeans approximation less likely to crash on big data (maybe even initializing with KMedoids :) ), and if the thing you're quantizing is trainable with some sort of gradient update step then you should do a few optimization passes in the discretized form as well.