| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lolinder 637 days ago
	Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law. Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

4 comments

6gvONxR4sf7o 637 days ago

Its not a legal question but a behavior and sustainability question. If it is fair use, but is undesirable for content makers, then they’re still not under any obligation to allow scraping. So they’ll try stuff like this, and other more restrictive bot blockers.

Remember when news sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.

link

nitwit005 636 days ago

In the end the websites always lose that battle if humans are willing to put effort into sharing it. You see people just pasting full article text or summaries into reddit comments. Those people are probably subscribers.

link

Aachen 637 days ago

If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network

The amount of scraping that happened in ~2020 as compared to 2024 is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei's singapore department who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server, and is therefore the only entry that my robots.txt denies)

link

lolinder 636 days ago

But here we're talking about Common Crawl being included in this scheme, which is explicitly designed to make it easier to use them than to make your own bad robot.

You block Common Crawl and all you'll be left with is the abusive bots that find workarounds.

link

chii 636 days ago

> you still can't walk into their office thousands of times per day

why not?

Esp. if that receptionist is an automaton, and isn't bothered by you. Of course, if you end up taking more resources and block others from asking as well, then you need to observe some etiquette (aka, throttle etc).

link

Aachen 636 days ago

> why not? Esp. if that receptionist is an automaton, and isn't bothered by you

I chose "thousands" to keep it within the realm of possibility while making it clear that it would bother a human receptionist precisely because humans aren't automatons, making the use of resources very obvious.

If you need an analogy to understand how an automated system could suffer from resources being consumed, perhaps picture a web server and billions of requests using a certain amount of bandwidth and CPU time each. Wait, now we're back to the original scenario!

link

MrDarcy 637 days ago

There is no objective black and white is or is not in this situation.

There is litigation of multiple cases and a judge making a judgement on each one.

Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.

link

sensanaty 636 days ago

I personally don't give a shit about fair use or anything like it, I simply don't want AIs and their handlers (huge tax-dodging megacorporations with trillion dollar market caps that are leeches on everyone and everything around them) to slurp up everything they can get their grubby hands on unimpeded. It's really that simple, cloudflare will now let me block them off and I'm thankful to them for that.

I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.

link