| HN Mirror

Given two clusters, I can predict the class of a new article by choosing the cluster whose centroid is closest to the article.

So, given some training data that produced two reasonable clusters with respect to the ground truth, I have a model that I can expect to generalize well on new data.

Now, this is not what that notebook shows, because it's missing the evaluating on testing data! The main point of the notebook is that the Jaccard Distance of the tokens of the HTML of the page, despite being very simple, appears to generate a reasonable model.