| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mullingitover 410 days ago

Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site, so googlebot wins because they’re the dominant search engine.

It makes sense to break that out so everyone has access to the same dataset at FRAND pricing.

My heart just wants Google to burn to the ground, but my brain says this is the more reasonable approach.

9 comments

toomuchtodo 410 days ago

https://commoncrawl.org/

This is similar to the natural monopoly of root DNS servers (managed as a public good). There is no reason more money couldn't go into either Common Crawl, or something like it. The Internet Archive can persist the data for ~$2/GB in perpetuity (although storing it elsewhere is also fine imho) as the storage system of last resort. How you provide access to this data is, I argue, similar to how access to science datasets is provided by custodian institutions (examples would be NOAA, CERN, etc).

Build foundations on public goods, very broadly speaking (think OSI model, but for entire systems). This helps society avoid the grasp of Big Tech and their endless desire to build moats for value capture.

mullingitover 410 days ago

The problem with this is in the vein of `Requires immediate total cooperation from everybody at once` if it's going to replace googlebot. Everyone who only allows googlebot would need to change and allow ccbot instead.

It's already the case that googlebot is the common denominator bot that's allowed everywhere, ccbot not so much.

xp84 410 days ago

Wouldn’t a decent solution, if some action happened where Google was divesting the crawler stuff, be to just do like browser user agents have always done (in that case multiple times to comical degrees)? Something like ‘Googlebot/3.1 (successor, CommonCrawl 1.0)’

toomuchtodo 410 days ago

Lots of good replies to your comment already. I'd also offer up Cloudflare offering the option to crawl customer origins, with them shipping the compressed archives off to Common Crawl for storage. This gives site admins and owners control over the crawling, and reduces unnecessary load as someone like Cloudflare can manage the crawler worker queue and network shipping internally.

(Cloudflare customer, no other affiliation)

kzrdude 410 days ago

That says that if google switches over to ccbot then the rest will follow.

CPLX 410 days ago

I mean if it’s created as part of setting the global rules for the internet you could just make it opt out.

sanderjd 410 days ago

Wait, is the suggestion here just about crawling and storing the data? That's a very different thing than "Google's search index"... And yeah, I would agree that it is undifferentiated.

toomuchtodo 410 days ago

If you have access to archived crawls, anyone can build and serve an index, or model weights (gpt).

fallingknife 410 days ago

Hosting costs are so minimal today that I don't think crawling is a natural monopoly. How much would it really cost a site to be crawled by 100 search engines?

everforward 410 days ago

A potentially shocking amount depending on the desired freshness if the bot isn’t custom tailored per site. I worked at a job posting site and Googlebot would nearly take down our search infrastructure because it crawled jobs via searching rather than the index.

Bots are typically tuned to work with generic sites over crawling efficiently.

fallingknife 410 days ago

Where is the cost coming from? Wouldn't a crawler mostly just accessing cached static assets served by CDN?

And what do you mean by your search infrastructure? Are you talking about elasticsearch or some equivalent?

everforward 410 days ago

No, in our case they were indexing job posts by sending search requests. Ie instead of pulling down the JSON files of jobs, they would search for them by sending stuff like “New York City, New York software engineer” to our search. Generally not cached because the searches weren’t something humans would search for (they’d use the location drop down).

I didn’t work on search, but yeah, something like Elasticsearch. Googlebot was a majority of our search traffic at times.

b112 410 days ago

One problem, it leaves one place to censor.

I agree that each front end should do it, but you can bet it will be a core service.

vasco 410 days ago

> The Internet Archive can persist the data for ~$2/GB in perpetuity

No they can't but do you have a source?

toomuchtodo 410 days ago

https://help.archive.org/help/archive-org-information/ and first hand conversations with their engineering team

> We estimate that permanent storage costs us approximately $2.00US per gigabyte.

https://webservices.archive.org/pages/vault/

> Vault offers a low-cost pricing model based on a one-time price per-gigabyte/terabyte for data deposited in the system, with no additional annual storage fees or data egress costs.

https://blog.dshr.org/2017/08/economic-model-of-long-term-st...

dmoy 410 days ago

What's the read throughout to get the data back out, and does it scale to what you'd need to have N search indexes building on top of this shared crawl?

adgjlsfhk1 410 days ago

they could charge data processing costs for reads

shadowgovt 410 days ago

Of all the bad ideas I've heard of where to slice Google to break it up, this... Is actually the best idea.

The indexer, without direct Google influence, is primarily incentivized to play nice with site administrators. This gives them reasons to improve consideration of both network integrity and privacy concerns (though Google has generally been good about these things, I think the damage is done regarding privacy that the brand name is toxic, regardless of the behaviors).

oceanplexian 410 days ago

> Crawling the internet is a natural monopoly.

How so?

A caching proxy costs you almost nothing and will serve thousands of requests per second on ancient hardware. Actually there's never been a better time in the history of the Internet to have competing search engines since there's never been so much abundance of performance, bandwidth, and software available at historic low prices or for free.

sokoloff 410 days ago

Costs almost nothing, but returns even less.*

There are so many other bots/scrapers out there that literally return zero that I don’t blame site owners for blocking all bots except googlebot.

Would it be nice if they also allowed altruist-bot or common-crawler-bot? Maybe, but that’s their call and a lot of them have made it on a rational basis.

* - or is perceived to return

threeseed 410 days ago

> that I don’t blame site owners for blocking all bots except googlebot

I run a number of sites with decent traffic and the amount of spam/scam requests outnumbers crawling bots 1000 to 1.

I would guess that the number of sites allowing just Googlebot is 0.

Aurornis 410 days ago

> that I don’t blame site owners for blocking all bots except googlebot.

I doubt this is happening outside of a few small hobbyist websites where crawler traffic looks significant relative to human traffic. Even among those, it’s so common to move to static hosting with essentially zero cost and/or sign up for free tiers of CDNs that it’s just not worth it outside of edge cases like trying to host public-facing Gitlab instances with large projects.

Even then, the ROI on setting up proper caching and rate limiting far outweighs the ROI on trying to play whack-a-mole with non-Google bots.

Even if someone did go to all the lengths to try to block the majority of bots, I have a really hard time believing they wouldn’t take the extra 10 minutes to look up the other major crawlers and put those on the allow list, too.

This whole argument about sites going to great lengths to block search indexers but then stopping just short of allowing a couple more of the well-known ones feels like mental gymnastics for a situation that doesn’t occur.

fc417fc802 410 days ago

> sites going to great lengths to block search indexers

That's not it. They're going to great lengths to block all bot traffic because of abusive and generally incompetent actors chewing through their resources. I'll cite that anubis has made the front page of HN several times within the past couple months. It is far from the first or only solution in that space, merely one of many alternatives to the solutions provided by centralized services such as cloudflare.

luckylion 410 days ago

Regarding allowlisting the other major crawlers: I've never seen any significant amount of traffic coming from anything but Google or Bing. There's the occasional click from one of the resellers (ecosia, brave search, duckduckgo etc), but that's about it. Yahoo? haven't seen them in ages, except in Japan. Baidu or Yandex? might be relevant if you're in their primary markets, but I've never seen them. Huawei's Petal Search? Apple Search? Nothing. Ahrefs & friends? No need to crawl _my_ website, even if I wanted to use them for competitor analysis.

So practically, there's very little value in allowing those. I usually don't bother blocking them, but if my content wasn't easy to cache, I probably would.

Onavo 410 days ago

In the past month there were dozens of posts about using proof of work and other methods to defeat crawlers. I don't think most websites tolerate heavy crawling in the era of Vercel/AWS's serverless "per request" and bandwidth billing.

stackskipton 410 days ago

Not everyone wants to deal with caching proxy because they think the load on their site under normal operations is fine if it's rendered server side.

immibis 410 days ago

You don't get to tell site owners what to do. The actual facts on the ground are that they're trying to block your bot. It would be nice if they didn't block your bot, but the other, completely unnatural and advertising-driven, monopoly of hosting providers with insane per-request costs makes that impossible until they switch away.

AlexandrB 410 days ago

They try to block your bot because Google is a monopoly and there's little to no cost for blocking everything except Google.

This isn't a "natural" monopoly, it's more like Internet Explorer 6.0 and everyone designing their sites to use ActiveX and IE-specific quirks.

luckylion 410 days ago

One possible answer: pay them for their trouble until you provide value to them, e.g. by paying some fraction of a cent for each (document) request.

BobaFloutist 410 days ago

Cool, you wanna solve micropayments now or wait until we've got cold fusion rolling first...?

luckylion 409 days ago

You wouldn't have to make them micropayments, you can pay out once some threshold is reached.

Of course, it would incentivize the sites to make you want to crawl them more, but that might be a good thing. There would be pressure on you to focus on quality over quantity, which would probably be a good thing for your product.

threeseed 410 days ago

> The actual facts on the ground are that they're trying to block your bot

Based on what evidence.

immibis 408 days ago

based on them matching the user-agent and sending you a block page? I don't know what else to tell you. It's in plain sight.

hkpack 410 days ago

Most of the tech is set for being a monopoly due to the negligible variable cost associated with serving a customer.

Thus being even slightly in front of others is reinforced and the gap only widens.

tananaev 410 days ago

Google search is a monopoly not because of crawling. It's because of the all the data it knows about website stats and user behavior. Original Google idea of ranking based on links doesn't work because it's too easily gamed. You have to know what websites are good based on user preferences and that's where you need to have data. It's impossible to build anything similar to Google without access to large amounts of user data.

luckylion 410 days ago

Sounds like you're implying that they are using Google Analytics to feed their ranking, but that's much easier to game than links are. User-signals on SERP clicks? There's a niche industry supplying those to SEOs (I've seen it a few times, I haven't seen it have any reliable impact).

AtlasBarfed 410 days ago

Page ranking sounds like a perfect application of artificial intelligence.

If China can apply it for total information awareness on their population, Google can apply it on page reliability

fc417fc802 410 days ago

I'm fairly certain many people have already tried to apply magical AI pixie dust to this problem. Presumably it isn't so simple in practice.

wslh 410 days ago

> so googlebot wins because they’re the dominant search engine.

I think it's also important to highlight that sites explicitly choose which bots to allow in their robots.txt files, prioritizing Google which reinforces its position as the de-facto monopoly. Even when other bots are technically able to crawl them.

1vuio0pswjnm7 410 days ago

CommonCrawl is not a vlaid comparison. Most robots.txt target CCBot.

Aurornis 410 days ago

> Crawling the internet is a natural monopoly. Nobody wants an endless stream of bots crawling their site,

Companies want traffic from any source they can get. They welcome every search engine crawler that comes along because every little exposure translates to incremental chances at revenue or growing audience.

I doubt many people are doing things to allow Googlebot but also ban other search crawlers.

> My heart just wants Google to burn to the ground

I think there’s a lot of that in this thread and it’s opening the door to some mental gymnastics like the above claim about Google being the only crawler allowed to index the internet.

nulld3v 410 days ago

> I doubt many people are doing things to allow Googlebot but also ban other search crawlers.

Sadly this is just not the case.[1][2] Google knows this too so they explicitly crawl from a specific IP range that they publish.[3]

I also know this, because I had a website that blocked any bots outside of that IP range. We had honeypot links (hidden to humans via CSS) that insta-banned any user or bot that clicked/fetched them. User-Agent from curl, wget, or any HTTP lib = insta-ban. Crawling links sequentially across multiple IPs = all banned. Any signal we found that indicated you were not a human using a web browser = ban.

We were listed on Google and never had traffic issues.

[1] https://onescales.com/blogs/main/the-bot-blocklist

[2] Chart in the middle of this page: https://blog.cloudflare.com/declaring-your-aindependence-blo... (note: Google-Extended != Googlebot)

[3] https://developers.google.com/search/docs/crawling-indexing/...

mattmaroon 410 days ago

Are sites really that averse to having a few more crawlers than they already do? It would seem that it’s only a monopoly insofar as it’s really expensive to do and almost nobody else thinks they can recoup the cost.

natebc 410 days ago

A few?

We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done.

EDIT: Just checked our anubis stats for the last 24h

CHALLENGE: 829,586

DENY: 621,462

ALLOW: 96,810

This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet).

Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF.

zrm 410 days ago

This seems like two different issues.

One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.

The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.

fc417fc802 410 days ago

> then you set traps like disallowing something in robots.txt and then ban anything that tries to access it

That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis. All you can be certain of is that a significant portion of your traffic is abusive.

That results in aggressive filtering schemes which in turn means permitted bots must be whitelisted on a case by case basis.

zrm 409 days ago

> That doesn't work at all when the scraper rapidly rotates IPs from different ASNs because you can't differentiate the legitimate from the abusive traffic on a per-request basis.

Well sure you can. If it's requesting something which is allowed in robots.txt, it's a legitimate request. It's only if it's requesting something that isn't that you have to start trying to decide whether to filter it or not.

What does it matter if they use multiple IP addresses to request only things you would have allowed them to request from a single one?

fc417fc802 408 days ago

> If it's requesting something which is allowed in robots.txt, it's a legitimate request.

An abusive scraper is pushing over your boxes. It is intentionally circumventing rate limits and (more generally) accurate attribution of the traffic source. In this example you have deemed such behavior to be abusive and would like to put a stop to it.

Any given request looks pretty much normal. The vast majority are coming from residential IPs (in this example your site serves mostly residential customers to begin with).

So what if 0.001% of requests hit a disallowed resource and you ban those IPs? That's approximately 0.001% of the traffic that you're currently experiencing. It does not solve your problem at all - the excessive traffic that is disrespecting ratelimits and gumming up your service for other well behaved users.

mattmaroon 409 days ago

I mean sure but if there were 3 search engines instead of one would you disallow two of them? The spam problem is one thing but I dont think having a ten search engines rather than two is going to destroy websites.

The claim that search is a natural monopoly because of the impact on websites of having a few more search competitors scanning them seems silly. I don’t think it’s a natural monopoly at all.

robinsonb5 410 days ago

A "few" more would be fine - but the sheer scale of the malicious AI training bot crawling that's happening now is enough to cause real availability problems (and expense) for numerous sites.

One web forum I regularly read went through a patch a few months ago where it was unavailable for about 90% of the time due to being hammered by crawlers. It's only up again now because the owner managed to find a way to block them that hasn't yet been circumvented.

So it's easy to see why people would allow googlebot and little else.