Hacker News new | ask | show | jobs
by Spivak 240 days ago
The total number of clusters. Determining this algorithmically is a fun open problem https://en.wikipedia.org/wiki/Determining_the_number_of_clus....

For the data I work with at $dayjob I've found the Silhouette algorithm to perform best but I assume it will be extremely field specific. Clustering your data and taking a representative sample of each cluster is such a powerful trick to make big data small but finding an appropriate K is an art more than a science.