Hacker News new | ask | show | jobs
by benwills 1516 days ago
In a very different way, I'm also involved in a search-related project. (edited to add: also going solo on my project as well) If you ever want to bounce ideas around, I'd totally be up for that.

Related: you mention other sources than Common Crawl for WARC data. Is there a list of those somewhere?

1 comments

Sure, my email is in my profile if you want to chat.

Some WARCs that go into IA get published on archive.org, not all of them, but some: https://archive.org/search.php?query=warc

It's also an all-around useful format as you can produce it from wget and other common tools. But the big reason I'm moving toward something relatively homomorphic to WARCs is to be able to (in the future) publish my own crawls.

Thanks for that link. I've done a bit of work with the Common Crawl data (and proposed moving to ZSTD with a proof of concept and performance metrics in C a few years ago).

I'll send you an email later this weekend to connect.