| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by blibble 273 days ago

every major news org now blocks the parasitic "AI" crawlers

examples:

    https://www.bbc.co.uk/robots.txt
    https://www.cnn.com/robots.txt
    https://www.nbcnews.com/robots.txt

all they will be training on now is spam

anyone that says "AI is the worst today it will ever be", no

because that was before the world reacted to it

5 comments

AstroBen 273 days ago

They torrented a shitload of books illegally and trained on them.. but they're unable to get past The Great Wall of robots.txt?

link

blibble 273 days ago

I doubt it's their only countermeasure

plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

link

selcuka 273 days ago

> plus it's a pretty dangerous game for them to play against large, powerful actors with legions of lawyers

Like book publishers?

link

Gigachad 273 days ago

If the AI crawlers circumvent the protection mechanisms it's a serious crime now rather than just "Well it was on the open internet for free". Wouldn't surprise me if the the news orgs are also looking at honeypot articles to see if the fake details slip in to LLMs.

link

crazygringo 273 days ago

It's not a serious crime, or any crime at all, to ignore robots.txt. It's entirely voluntary whether you want to follow it or not. If you don't, you're being a dick maybe, but that's not a crime.

link

Gigachad 273 days ago

It's not just robots.txt, if you've tried using a VPN lately, so many sites like reddit/youtube/etc block you from viewing content until you log in. Every major website is getting anti scraping tech in the last year. Even archive.org is getting blocked from more and more sites since it can be used for indirect scraping of sites.

link

simianwords 273 days ago

robots.txt prevents real time search use for grounding and citations.

link

crazygringo 273 days ago

No it doesn't. It has zero legal force. Or any technical force either.

link

simianwords 273 days ago

Not an expert so I ask: no technical force either? Is it just a polite ask then?

link

zdragnar 273 days ago

It's hardly even a polite ask. It's literally a text file. Automated http clients, such as search engine indexers (Google, yahoo, etc) are expected to use it to know what pages can be visited or not. That expectation is nothing more than a convention.

If you are on a Mac or Linux computer, odds are it has a program called curl pre-installed. If you type in curl website address in a terminal, it'll fetch make a request and download the response. Robot.txt never gets involved. Same is true for AI agents and search engines that aren't polite.

link

euLh7SM5HDFY 273 days ago

Linkedin lost their anti-scrapping suit: https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin... but it seems since then they were able to successfully appeal that decision.

Regardless - requiring an account to read anything, even a "free" one, totally changes whole situation. Even when sites terms of service are limited by local law.

link

crazygringo 273 days ago

Correct. Literally just a polite ask.

link

strangattractor 273 days ago

Great Firewall actually. Robots.txt depended on the integrity of the companies crawling. I think they have demonstrated how much integrity they actually have:)

link

klysm 273 days ago

And this is stopping who exactly?

link

ares623 273 days ago

I suppose there’s a non-zero chance that a future lawsuit can point to this explicit block and say “see judge, we explicitly don’t want them crawling our stuff. Remember that Linkedin case from a while back?”

link

simonw 273 days ago

They seem to still be allowing Google to crawl them, unsurprisingly.

Advantage Gemini.

link

selcuka 273 days ago

No, they explicitly block Gemini as well:

    User-agent: Google-Extended
    Disallow: /

Gemini still uses the same user agent, but it has a different robots.txt entry (Google-Extended) [1]:

> Google-Extended is a standalone product token that web publishers can use to manage whether content Google crawls from their sites may be used for training future generations of Gemini models that power Gemini Apps and Vertex AI API for Gemini and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI.

[1] https://developers.google.com/search/docs/crawling-indexing/...

link

simonw 273 days ago

Honestly I feel like "training" is a bit of a distraction at this point. For a lot of types of content RAG-style search is much more important.

I imagine many of the orgs that are blocking "training" don't understand the difference between training and inference-time tool-based context extension (which really needs an agreed upon name, it's hard to talk about right now).

link

selcuka 273 days ago

My understanding is that it also affects RAG ("grounding" in Google terminology):

> [...] and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI.

So they seem to be blocking both training and RAG while still allowing search engine indexing.

link

strangattractor 273 days ago

The Google advantage is that you need to show up in their search results or else you are a nobody.

link

simianwords 273 days ago

This is also for real time "grounding". So it makes it even harder for AI to give factual answers.

link

faangguyindia 273 days ago

My take is when AGI comes into existence and breaks out of labs to become our masters. Those who opposed AI adoption will be the first who will be sent to labor camps. I want to be in good books of AGI masters, so i am helping apply AI everywhere.

link

Analemma_ 273 days ago

This is pretty much word-for-word the reasoning behind Roko’s basilisk, which made its proponents an internet laughingstock for a decade, but is a surprisingly tricky thing to actually refute if you accept the premise that AGI is in fact coming.

link

AstroBen 273 days ago

It seems quite easy to refute—why would it punish anyone?

We would pose 0 threat at that point to any super intelligence, and I highly doubt it would have anything like a human grudge. It's just a case of anthropomorphizing it

link

ToValueFunfetti 273 days ago

The premise is that it's trying to influence human behavior before it becomes powerful by punishing them afterwards. Like how part of the reason you give the guy a ticket is to substantiate the disincentive for the speeding he already did. It's not an emotional thing.

What's sketchy is that you and it come to this arrangement without communicating. Because you are confident this thing that has total power over you will come into existence and will have wanted something of you now, you're meant to have entered a contract. This is suspect and I think falls prey to the critiques of Pascal's wager- there are infinite things superintelligence might want. But it's certainly tricky.

link

ares623 273 days ago

It’s just another form of god to believers. And what is a god if not a tool for punishment

link

JKCalhoun 273 days ago

> Those who opposed AI adoption will be the first who will be sent to labor camps.

I know you're joking, but other people are serious about this. Why do they think that an AGI will be vengeful? So strange.

link

macintux 273 days ago

Reminds me of Bruce Wayne from the dreadful DC film.

> He has the power to wipe out the entire human race, and if we believe there's even a one percent chance that he is our enemy we have to take it as an absolute certainty... and we have to destroy him.

link

teitoklien 273 days ago

The desire to become superbeing entity cannot be separated from supremacist ideals

link

nemomarx 273 days ago

wouldn't the people who put it to work coding and writing copy be worse? That's like slave ownership if you assume it can become sentient

link

faangguyindia 273 days ago

it's like blood vs gourmet food for a vampire. Vampire doesn't care about gourmet food, it only needs blood.

Similarly, AI needs data and energy. People using it to write code are providing exactly that.

link

vict7 273 days ago

AKA Roko’s Basilisk

link