Hacker News new | ask | show | jobs
by rwg 4589 days ago
I really wanted to love the Common Crawl corpus. I needed an excuse to play with EC2, I had a project idea that would benefit an open source project (Mozilla's pdf.js), and I had an AWS gift card with $100 of value on it. But when I actually got to work, I found the choice of Hadoop sequence files containing JSON documents for the crawl metadata absolutely maddening and slammed headfirst into an undocumented gotcha that ultimately killed the project: the documents in the corpus are truncated at ~512 kilobytes.

It looks like they've fixed the first problem by switching to gzipped WARC files, but I can't find any information about whether or not they're still truncating documents in the archive. I guess I'll have to give it another look and see...

1 comments

I'd have to check the last crawl settings, but I believe I set the last crawl was set to truncate at 1 MB (response body size, so that could be 1 MB uncompressed or 1 MB compressed depending on what the source web server sent out).

At one point I was tried out a 10 MB limit, but the thing is we try to limit crawls to webpages and few are that big, but occasionally we'd hit sites ISDN-speed connections that would slow down the whole thing.

For the next crawl, we'll mark which pages are truncated and which aren't (an oversight in the last crawl) so at least you can skip over them.

Also, hopefully you'll find the new metadata files to be a little clearer. We switched over the same format Internet Archive uses and it contains quite a bit more data (xpath truncated paths for each link for instance).