Hacker News new | ask | show | jobs
by blibble 273 days ago
every major news org now blocks the parasitic "AI" crawlers

examples:

    https://www.bbc.co.uk/robots.txt
    https://www.cnn.com/robots.txt
    https://www.nbcnews.com/robots.txt
all they will be training on now is spam

anyone that says "AI is the worst today it will ever be", no

because that was before the world reacted to it

5 comments

They torrented a shitload of books illegally and trained on them.. but they're unable to get past The Great Wall of robots.txt?
I doubt it's their only countermeasure

plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

> plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

Like book publishers?

If the AI crawlers circumvent the protection mechanisms it's a serious crime now rather than just "Well it was on the open internet for free". Wouldn't surprise me if the the news orgs are also looking at honeypot articles to see if the fake details slip in to LLMs.
It's not a serious crime, or any crime at all, to ignore robots.txt. It's entirely voluntary whether you want to follow it or not. If you don't, you're being a dick maybe, but that's not a crime.
It's not just robots.txt, if you've tried using a VPN lately, so many sites like reddit/youtube/etc block you from viewing content until you log in. Every major website is getting anti scraping tech in the last year. Even archive.org is getting blocked from more and more sites since it can be used for indirect scraping of sites.
robots.txt prevents real time search use for grounding and citations.
No it doesn't. It has zero legal force. Or any technical force either.
Not an expert so I ask: no technical force either? Is it just a polite ask then?
It's hardly even a polite ask. It's literally a text file. Automated http clients, such as search engine indexers (Google, yahoo, etc) are expected to use it to know what pages can be visited or not. That expectation is nothing more than a convention.

If you are on a Mac or Linux computer, odds are it has a program called curl pre-installed. If you type in curl website address in a terminal, it'll fetch make a request and download the response. Robot.txt never gets involved. Same is true for AI agents and search engines that aren't polite.

Linkedin lost their anti-scrapping suit: https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin... but it seems since then they were able to successfully appeal that decision.

Regardless - requiring an account to read anything, even a "free" one, totally changes whole situation. Even when sites terms of service are limited by local law.

Correct. Literally just a polite ask.
Great Firewall actually. Robots.txt depended on the integrity of the companies crawling. I think they have demonstrated how much integrity they actually have:)
And this is stopping who exactly?
I suppose there’s a non-zero chance that a future lawsuit can point to this explicit block and say “see judge, we explicitly don’t want them crawling our stuff. Remember that Linkedin case from a while back?”
They seem to still be allowing Google to crawl them, unsurprisingly.

Advantage Gemini.

No, they explicitly block Gemini as well:

    User-agent: Google-Extended
    Disallow: /
Gemini still uses the same user agent, but it has a different robots.txt entry (Google-Extended) [1]:

> Google-Extended is a standalone product token that web publishers can use to manage whether content Google crawls from their sites may be used for training future generations of Gemini models that power Gemini Apps and Vertex AI API for Gemini and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI.

[1] https://developers.google.com/search/docs/crawling-indexing/...

Honestly I feel like "training" is a bit of a distraction at this point. For a lot of types of content RAG-style search is much more important.

I imagine many of the orgs that are blocking "training" don't understand the difference between training and inference-time tool-based context extension (which really needs an agreed upon name, it's hard to talk about right now).

My understanding is that it also affects RAG ("grounding" in Google terminology):

> [...] and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI.

So they seem to be blocking both training and RAG while still allowing search engine indexing.

The Google advantage is that you need to show up in their search results or else you are a nobody.
This is also for real time "grounding". So it makes it even harder for AI to give factual answers.
My take is when AGI comes into existence and breaks out of labs to become our masters. Those who opposed AI adoption will be the first who will be sent to labor camps. I want to be in good books of AGI masters, so i am helping apply AI everywhere.
This is pretty much word-for-word the reasoning behind Roko’s basilisk, which made its proponents an internet laughingstock for a decade, but is a surprisingly tricky thing to actually refute if you accept the premise that AGI is in fact coming.
It seems quite easy to refute—why would it punish anyone?

We would pose 0 threat at that point to any super intelligence, and I highly doubt it would have anything like a human grudge. It's just a case of anthropomorphizing it

The premise is that it's trying to influence human behavior before it becomes powerful by punishing them afterwards. Like how part of the reason you give the guy a ticket is to substantiate the disincentive for the speeding he already did. It's not an emotional thing.

What's sketchy is that you and it come to this arrangement without communicating. Because you are confident this thing that has total power over you will come into existence and will have wanted something of you now, you're meant to have entered a contract. This is suspect and I think falls prey to the critiques of Pascal's wager- there are infinite things superintelligence might want. But it's certainly tricky.

It’s just another form of god to believers. And what is a god if not a tool for punishment
> Those who opposed AI adoption will be the first who will be sent to labor camps.

I know you're joking, but other people are serious about this. Why do they think that an AGI will be vengeful? So strange.

Reminds me of Bruce Wayne from the dreadful DC film.

> He has the power to wipe out the entire human race, and if we believe there's even a one percent chance that he is our enemy we have to take it as an absolute certainty... and we have to destroy him.

The desire to become superbeing entity cannot be separated from supremacist ideals
wouldn't the people who put it to work coding and writing copy be worse? That's like slave ownership if you assume it can become sentient
it's like blood vs gourmet food for a vampire. Vampire doesn't care about gourmet food, it only needs blood.

Similarly, AI needs data and energy. People using it to write code are providing exactly that.

AKA Roko’s Basilisk