Hacker News new | ask | show | jobs
by CM30 3505 days ago
How is the clustering analysis going to work?

Because the current hard coded list of URLs is a start, but it's not really a scalable solution to the issue.

However, from what I can see in this file:

https://github.com/jacquerie/stop-the-bullshit/blob/master/d...

It just seems like it's going to compare examples of articles included in the source files, as found here:

https://github.com/jacquerie/stop-the-bullshit/tree/master/d...

So how is it going to detect the difference between a real or fake piece from this?

1 comments

Given two clusters, I can predict the class of a new article by choosing the cluster whose centroid is closest to the article.

So, given some training data that produced two reasonable clusters with respect to the ground truth, I have a model that I can expect to generalize well on new data.

Now, this is not what that notebook shows, because it's missing the evaluating on testing data! The main point of the notebook is that the Jaccard Distance of the tokens of the HTML of the page, despite being very simple, appears to generate a reasonable model.