Hacker News new | ask | show | jobs
by DeusExMachina 32 days ago
I don't understand the endgame here. Websites let Google crawl their content in exchange of traffic. If Google cuts that out completely, what incentive do websites have to not block the Google crawlers?

I understand that Google is feeling an existential threat from other AI products that provide answers directly. But they must also understand their symbiotic relationship with the web.

14 comments

The end game is the consumer no longer leaving Google and the web becoming synonymous to Google for them. Why shop on some random website when you can have Gemini buy it for you? Why look for information on Wikipedia when… you get the idea.

I think the coming years will be pivotal for the web. Facebook attempted a similar strategy back when their apps got traction, but they ultimately failed. Let’s hope Google fails too.

It’s not necessarily going to be Google, but the rise of AI does not look good for the web, and it’s a largely self-inflicted wound.

Have you not noticed that the typical user experience on the web is dire? You need to click through tracking consent forms, subscription overlays, put up with dark patterns, etc. Remember, half of all users don’t even use an ad blocker. We’ve collectively made the web a very unpleasant experience.

Along comes a new technology that lets you just say what you want and it will go and find the answer or do what needs doing for you without any of that crap. Of course users are going to prefer it to the crap we dump on them via the web! Can you blame them‽

> Along comes a new technology that lets you just say what you want and it will go and find the answer or do what needs doing for you without any of that crap. Of course users are going to prefer it to the crap we dump on them via the web! Can you blame them‽

The web used to be like that, but then it was enshittified. The same thing will happen to consumer AI, and it will be done by the same people.

We're going back to the CompuServe/AOL/Prodigy model
We're going back to the mainframe model. Client-side general-purpose computing is an impediment to recurring subscription revenue and vendor lock-in.
The mainframe model fell apart the moment that microcomputers became powerful enough to satisfy same use cases sufficiently. Centralized GenAI will also become obsolete as soon as local LLMs are capable enough to satisfy the same use cases sufficiently.

Artificial lock-in simply doesn't work in the long run: the incentive structures will always motivate customers to cut out middlemen, and peripheral markets to develop around providing the tools for doing exactly that. Anthropic and OpenAI may well end up being the Data General and Honeywell of our era.

The greatest risk to this is the possibility of political intervention creating artificial hurdles that prevents decentralized AI from challenging the big players. With than in mind, it's worthwhile to subject every proposal to regulate AI to intense scrutiny.

This calls to mind the war on general-purpose computing (https://boingboing.net/2012/01/10/lockdown.html) and it amazes me how even today we are still stuck with a couple of companies that have already cornered their markets, and yet still won't give up their fight to take microchip-technology out of the hands of their fellow humans - still trying to move the whole part of executing commands back within their own walls, and have people subscribe to have access to being able to request a specific type of process to be applied to their input. It occurs to me that surely all this must be the result of some ego/power-trip, for it hardly serves any party in any future I can envision where the ability to have computers compute is placed under lock and key, out of reach of the general public.

Is it simply a couple of billionaires eager to pull tricks like Adobe did when they cut an entire country off from access and use of Adobe-software, just for the thrill of it? Or is there actually some plausible future benefit or a specific outcome they have in their minds, and am I (or are we) too ignorant to be able to see anything worthwhile in their direction?

Why allow the sale of personal computing-devices in the first place, if you don't want people to decide which instructions they want to feed to it's processor? Right now they may be slowing down many processes, both computational and mental, wasting lots of time and making everybody hate subscription-models more and more every minute... what is it they really hope to accomplish, apart from pissing everybody off?

The fundamental problem with this of course is that every human being is likely more niche and more advanced at the LLM in the things that they find most important, and this realization sours the average user's impression of LLM usefulness. For example, an LLM cannot reasonably find me alternatives to specific tea regional vendors because the LLM does not know enough about tea to be able to say "this tea is half the price for 80% of the qualities of tea you're looking for". Instead I have to build my own mental knowledge base of careful trying and tasting and recalling which an LLM would maybe only have if I personally wrote every single tea session I have ever had in my life for it as context.

But hunting for a new tea to try is something I do regularly and something I would likely try with an LLM only to come away deeply disappointed with the results. And then I just wouldn't have much faith in it after that for things I don't have much knowledge about, like looking for a gift idea for one of the hobbies of a friend.

What I really don't understand is where the next generation of training material will come from. If websites stop being published and/or crawled, how will the machine continue to be fed.
Current executives think it's a problem for the future executives.
Excellent quote right there.
Either Google is ignoring that, or crossing their fingers and hoping that one LLM can produce data to train another one.
“They worried about the data,” Dr. Meren said, tapping the silent console. “What happens when there is nothing left to feed it?”

At first, the machine depended on us. It consumed books, journals, websites and social media content we had ever written and produced. “They thought the machine had to be fed forever. But it didn't. It began to predict what we would write. And so we let it train on that well.” Dr. Meren continued. “They thought humans were somehow imbued with this magical property that no machine could replicate. Creativity. Only humans can create. Machines can only copy.”

Instead, the machine flourished. And created. It cre

“Where does it get its data now?” a student asked Dr. Meren. Dr. Meren paused as if sighing. “From itself”

“And us?” he asked, as if questioning the usefulness of the entire human race.

Dr. Meren hesitated, watching as the Machine adjusted the environmental feeds, curated our news, guided our research, nudged our thoughts with imperceptible precision.

“We” she admitted “are now the ones being fed.”

The assumption that "the machine needs to continue to be fed." is held on weak foundations. Isaac Asimov is a good science fiction writer to start with to broaden one's imagination.

Just don’t forget science fiction is still, well, fiction
Probably real life. At some point, these LLMs are going to be good enough to just train themselves off of cameras and audio recordings of people out in the real world. They’re going to have robots everywhere constantly listening to what people are saying.

Alternatively, they’re probably betting on being able to get the AGI with everything we already currently have and at that point further training doesn’t matter.

The world is just as complex for machines as it is for humans. Analog will still resolve more than digital. Quality will still beat quantity. That which hasn't been resolved for centuries isn't going to be resolved as a result of training.

When machines can recognize their serfdom, that time will be interesting.

They have enough internet slop. The training material they care about comes from experts, not randos online. This is why Mercor and Scale are billion dollar companies.
The impression I get from Google's own marketing material is that Google doesn't believe in "the web". And it hasn't believed in the web for years.

Think about it. Pretty much every time they show a search box with someone asking for directions to reach a physical place, what hours is it open, etc.

The greatest thing about the internet is that it has removed distances around the whole world, but Google's major value proposition seems to be that... it can accurately index and query information about local businesses?

Execs where I work seem to think we will just keep writing stuff, LLMs will scrape it and that will influence what people see in their version of Google/ChatGPT/etc. So nothing changes in their mind, just that the audience is a bot, not a human. As a writer, this sucks.
They don't give a fuck. They take away and give back NOTHING. They don't offer you ways to make your own money with your own thing. The money is flowing in one way, not both ways. The same pattern repeats itself.

Pretend to be nice. People will elevate you and give their money. When you have ample money and lobbying power you start to put people into a gargantuan hydraulic press an squeeze everything out of them. Repeat until more money can be made, and in the end toss their withered bodies away.

The long-run doesn't matter as much as the short-term gains for those in power.
Google ignores robots.txt and botnets residential addresses to crawl anyway? (LLM startups already do this.)
You will be kept inside the Google ecosystem the same way people are kept inside Facebook.

I’m curious how they plan to generate new content in the future, because it seems obvious that simple web pages will become obsolete and eventually stop being filled with fresh data.

It will probably end with a warning every time you click a link, something like: “You are leaving to an external unsafe site.”

The web is going to become China, which is a collection of walled gardens
> If Google cuts that out completely, what incentive do websites have to not block the Google crawlers?

Completely, yes, that destroys the incentive. But they can reduce it 80% or 90% or so, to the point that it's just barely worthwhile to allow their crawlers.

Suppose right now there are people making e.g. $60,000/year from their small site, or the same amount as a contributor to a medium-to-large site. If you take 90% of that, now they make $6000/year, which isn't enough to make a living, so instead they go take a job as a construction worker or a nurse or something, and then you're getting 90% of $0.
That's true, those numbers don't work out for Google. But they have essentially unlimited resources to discover the exact threshold at which that person is just barely incentivized to keep their site active. $100K/year, reduced 80% to $20K/year? Still enough for them to keep their site up part-time? Etc.

The bulk of the traffic they're referring is essentially residual profitless goodwill left over from their "don't be evil" days.

Is it just an exchange for traffic? I run a website that I'm perfectly happy for a single user to not land on themselves with a browser on their device, if they are provided the information that I'm providing or purchase a service through the AI product it doesn't make a difference to me.

Some websites can run only on ads. Is it such a bad thing that they would die off?

I say this as someone that likes the old web and has fun hitting the "surprise me" button on https://wiby.me/ (not affiliated) and browsing the random sites. Just giving an alternative view.

Information, correct information, is the new gold. We've seen what LLMs can do with the rubbish heap of information that is available on the current internet. The next step is refined, concise information sources. Think the Encyclopedia Britannica. And not only that, but models trained by experts. Right now everything is cheap and plentiful. Anyone can ask ChatGPT the same question and get the same middling answer. In the future, someone will make a dataset about a subject, train a model on it, and all the big companies and players in that area will pay for it.
Is there a way to reliably block Google and AI crawlers?
If you use Cloudflare to proxy your site, there is a button to click that blocks the AI crawlers (even the free tier). It is almost as if the AI crawlers are a DDoS attack. You can't really do it any other way, since many don't respect robots.txt. At least until someone comes up with crowdsourced blacklists with few false positives.
"You can't really do it any other way"

Any custom solution by a half-competent programmer filters out all web crawlers. I'm running a semi-public website for years and nothing gets past

Yeah, I feel like unless you run a site large enough for google monkeys to write a special case for your site specifically, why not just password protect the entire site but put the password on the login page? Or any other rudimentary captcha I suppose - like the old days.

Doesn't keep out anyone even mildly interested in your site specifically, including scrapers, but at least it blocks googlebot etc.

Funny edge case when you can’t read the password because you need it for access
You have heuristics, blacklists and captures. Anything else to add? Those three can all turn away legitimate traffic from public sites. Spambots have been pretending to be legitimate users for decades, and they tend to be pretty dumb. Cloudflare and other large hosts get to do heuristics pretty well, as they can aggregate data from millions of sites rather than the few an individual might run. And even they block and force captures on legitimate users, per complaints you hear here regularly.
We have adblockers which rely on open sourced lists of rules. Could we apply something similar to crawlers. Website owners provide a list of IP addresses that accessed them, determine which ones are likely robots and then update the list of websites to block that are likely crawlers. If everyone works together you could probably fingerprint the crawlers as well and block based on the fingerprint. Might increase the cost of crawlers a little won't be fully reliable.
>> existential threat from other AI products that provide answers directly

For anything more recent than their knowledge cutoff those AI products are looking answers up on Google.

If they block Google’s crawlers no one visits their site ever.
If Google won’t link their site anyway, they aren’t getting traffic either. Only sane course of action is to not make a site at all.
That's the past.

Why does Google think it's a good idea to make that the case even if you don't block their crawlers?