Hacker News new | ask | show | jobs
by nl 5261 days ago
What package out there implements the algorithms for this, and is well-documented and trivial enough to use that a 14-year-old can understand them?

Nutch[1].

Nutch doesn't deal with modern web spam particularly well, but I'd say it matched early Google pretty well. Specifically, it implements Page Rank, has a reliable web crawler and a web-scale data store.

[1] http://nutch.apache.org/about.html

1 comments

Wow yeah, that actually looks like it would do the job. There's a part of me now that wants to implement a spam classifier on top of Nutch to see how good of a web crawler I can create… thanks for the link!