| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by elashri 1175 days ago

Hi Mo, thanks for this work. It seems interesting.

I had the chance to play a little bit and wanted to compare that with KMeans. I relied on sklearn KMeans implementation.

Furthermore, I did some examples (mostly what is available). But One interesting thing I did is I generated some isotropic Gaussian blobs for clustering (using `make_blobs`) and then tried a comparison between the two methods. Bandit PAM was a little bit better for a couple of metrics I used, but also much faster. I was generating `n_samples=1000` but then I increased it to `n_samples=10000` and I found that it is much slower than KMeans, see [1] and code is in [2]. Is there a particular reason for that?

[1] https://imgur.com/a/VibpgNz

[2] https://paste.elashri.xyz/aXCE

1 comments

motiwari 1173 days ago

Thanks for bug report and repro steps! I've filed this issue at https://github.com/motiwari/BanditPAM/issues/244 on our repo.

I suspect that this is because the scikit-learn implementation of KMeans subsamples the data and uses some highly-optimized data structures for larger datasets. I've asked the team to see how we can use some of those techniques in BanditPAM and will update the Github repo as we learn more and improve our implementation.

link