Hacker News new | ask | show | jobs
by esafak 886 days ago
The first problem is picking k. The second problem is the definition of distance or, equivalently, the uniformity of the space. Naively using the Euclidean distance in an embedding where similarity is non-uniform leads to bad outcomes. This problem is solved by learning a uniform embedding, and this is much harder than running k-means.

k-means assumes these hard parts of the problem are taken care of and offers a trivial solution to the rest. Thanks for the help, I'll cluster it myself.