Hacker News new | ask | show | jobs
by marginalia_nu 1516 days ago
Sure, my email is in my profile if you want to chat.

Some WARCs that go into IA get published on archive.org, not all of them, but some: https://archive.org/search.php?query=warc

It's also an all-around useful format as you can produce it from wget and other common tools. But the big reason I'm moving toward something relatively homomorphic to WARCs is to be able to (in the future) publish my own crawls.

1 comments

Thanks for that link. I've done a bit of work with the Common Crawl data (and proposed moving to ZSTD with a proof of concept and performance metrics in C a few years ago).

I'll send you an email later this weekend to connect.