|
|
|
|
|
by benwills
1516 days ago
|
|
In a very different way, I'm also involved in a search-related project. (edited to add: also going solo on my project as well) If you ever want to bounce ideas around, I'd totally be up for that. Related: you mention other sources than Common Crawl for WARC data. Is there a list of those somewhere? |
|
Some WARCs that go into IA get published on archive.org, not all of them, but some: https://archive.org/search.php?query=warc
It's also an all-around useful format as you can produce it from wget and other common tools. But the big reason I'm moving toward something relatively homomorphic to WARCs is to be able to (in the future) publish my own crawls.