| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dpezely 2066 days ago

Have you considered using Common Crawl [1], and if so, what was your assessment when compared to having your own spyders?

Long-term, a combination of theirs and your own could be optimal.

There are strengths and weaknesses with using their dumps: on one hand, benefits include them having crawled and having dealt with being throttled, etc. They offer monthly dumps for general content and daily dumps for news [2].

On the other hand, it's a huge pile of data to wade through, and their index format might not be your preferred method. The archive and index reside officially at AWS, so that may decide where to process it. (Not sure whether other providers maintain a copy as well or not.)

By "huge", specifically:

> October 2020 [...] contains 2.71 billion web pages or 280 TiB of uncompressed content.

From our analysis a few years ago, that was to be the approach for the now-defunct Snagz.net [3] (which never fully launched because co-founders were unable to join due to extenuating circumstances).

[1] https://CommonCrawl.org

[2] https://commoncrawl.org/2016/10/news-dataset-available/ - this one can be hard to find unless you know to look for it

[3] https://web.archive.org/web/20180320001756/http://snagz.net/

1 comments

ivyabc 2066 days ago

We think the quality of their crawled pages (both web and news) is not as good as ours. Our total dataset is larger than their monthly numbers.

link