| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Aloisius 4626 days ago

You can fetch a single WARC file directly like say:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/warc/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.gz

They are around 850 MB each.

The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wat/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wat.gz

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wet/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wet.gz

1 comments

ccleve 4626 days ago

Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.

link

froo 4626 days ago

  s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/

That should get you about 90% on your way.

link