|
|
|
|
|
by toomuchtodo
411 days ago
|
|
https://commoncrawl.org/ This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc). Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture. |
|
It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.