Hacker News new | ask | show | jobs
by elashri 1175 days ago
Hi Mo, thanks for this work. It seems interesting.

I had the chance to play a little bit and wanted to compare that with KMeans. I relied on sklearn KMeans implementation.

Furthermore, I did some examples (mostly what is available). But One interesting thing I did is I generated some isotropic Gaussian blobs for clustering (using `make_blobs`) and then tried a comparison between the two methods. Bandit PAM was a little bit better for a couple of metrics I used, but also much faster. I was generating `n_samples=1000` but then I increased it to `n_samples=10000` and I found that it is much slower than KMeans, see [1] and code is in [2]. Is there a particular reason for that?

[1] https://imgur.com/a/VibpgNz

[2] https://paste.elashri.xyz/aXCE

1 comments

Thanks for bug report and repro steps! I've filed this issue at https://github.com/motiwari/BanditPAM/issues/244 on our repo.

I suspect that this is because the scikit-learn implementation of KMeans subsamples the data and uses some highly-optimized data structures for larger datasets. I've asked the team to see how we can use some of those techniques in BanditPAM and will update the Github repo as we learn more and improve our implementation.