| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nl 4626 days ago

You don't usually download this data - you process it on AWS to your requirements.

Seriously - they give you an easy way to create these subsets yourself[1]. That is a much better solution than them trying to anticipate the exact needs of every potential client.

[1] http://commoncrawl.org/get-started/

2 comments

malandrew 4626 days ago

I guess what I was suggesting is "given enough eyeballs, all spam and poor quality content is shallow"

There is definitely a benefit in using the community to identify valuable subsets and then individually putting your energy towards building discovery/search products around that subset.

link

gsnedders 4626 days ago

Is the example code still right with the new file formats for this new crawl?

link