|
Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site, so googlebot wins because they’re the dominant search engine. It makes sense to break that out so everyone has access to the same dataset at FRAND pricing. My heart just wants Google to burn to the ground, but my brain says this is the more reasonable approach. |
This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).
Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.