| HN Mirror

> frown on that sort of dataset

That example was definitely contrived and designed to strongly illustrate the point. I'll counter slightly that non-peaky topologies aren't uncommon, but they're unlikely to look anything that would push KMedoids to a pathological state rather than just a slightly worse state ("worse" assuming that KMeans is the right choice for a given problem).

> worth pointing out .. data reference

Totally agreed. I hope my answer didn't come across as too negative. It's good work, and everyone else was talking about the positives, so I just didn't want to waste too much time echoing again that while getting the other points across.

> bolt reference

https://github.com/dblalock/bolt

They say as much in their paper, but they aren't the first vector quantization library by any stretch. Their contributions are, roughly:

1. If you're careful selecting the right binning strategy then you can cancel out a meaningful amount of discretization error.

2. If you do that, you can afford to choose parameters that fit everything nicely into AVX2 machine words, turning 100s of branching instructions into 1-4 instructions.

3. Doing some real-world tests to show that (1-2) matter.

Last I checked their code wasn't very effective for the places I wanted to apply it, but the paper is pretty solid. I'd replace it with a faster KMeans approximation less likely to crash on big data (maybe even initializing with KMedoids :) ), and if the thing you're quantizing is trainable with some sort of gradient update step then you should do a few optimization passes in the discretized form as well.