| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by toomuchtodo 411 days ago

https://commoncrawl.org/

This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).

Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.

5 comments

mullingitover 410 days ago

The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.

It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.

xp84 410 days ago

Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’

toomuchtodo 410 days ago

Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.

(Cloudflare customer, no other affiliation)

kzrdude 410 days ago

That says that if google switches over to ccbot then the rest will follow.

CPLX 410 days ago

I mean if it’s created as part of setting the global rules for the internet you could just make it opt out.

sanderjd 411 days ago

Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.

toomuchtodo 411 days ago

If you have access to archived crawls, anyone can build and serve an index, or model weights (gpt).

fallingknife 410 days ago

Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?

everforward 410 days ago

A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.

Bots are typically tuned to work with generic sites over crawling efficiently.

fallingknife 410 days ago

Where is the cost coming from? Wouldn't a crawler mostly just accessing cached static assets served by CDN?

And what do you mean by your search infrastructure? Are you talking about elasticsearch or some equivalent?

everforward 410 days ago

No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).

I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.

b112 411 days ago

One problem, it leaves one place to censor.

I agree that each front end should do it, but you can bet it will be a core service.

vasco 411 days ago

> The Internet Archive can persist the data for ~$2/GB in perpetuity

No they can't but do you have a source?

toomuchtodo 411 days ago

https://help.archive.org/help/archive-org-information/ and first hand conversations with their engineering team

> We estimate that permanent storage costs us approximately $2.00US per gigabyte.

https://webservices.archive.org/pages/vault/

> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.

https://blog.dshr.org/2017/08/economic-model-of-long-term-st...

dmoy 410 days ago

What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?

adgjlsfhk1 410 days ago

they could charge data processing costs for reads