Hacker News new | ask | show | jobs
by lolinder 637 days ago
Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

4 comments

Its not a legal question but a behavior and sustainability question. If it is fair use, but is undesirable for content makers, then they’re still not under any obligation to allow scraping. So they’ll try stuff like this, and other more restrictive bot blockers.

Remember when news sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.

In the end the websites always lose that battle if humans are willing to put effort into sharing it. You see people just pasting full article text or summaries into reddit comments. Those people are probably subscribers.
Copyright is only part of the equation, there's also the use of other people's resources

If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network

The amount of scraping that happened in ~2020 as compared to 2024 is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei's singapore department who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server, and is therefore the only entry that my robots.txt denies)

But here we're talking about Common Crawl being included in this scheme, which is explicitly designed to make it easier to use them than to make your own bad robot.

You block Common Crawl and all you'll be left with is the abusive bots that find workarounds.

> you still can't walk into their office thousands of times per day

why not?

Esp. if that receptionist is an automaton, and isn't bothered by you. Of course, if you end up taking more resources and block others from asking as well, then you need to observe some etiquette (aka, throttle etc).

> why not? Esp. if that receptionist is an automaton, and isn't bothered by you

I chose "thousands" to keep it within the realm of possibility while making it clear that it would bother a human receptionist precisely because humans aren't automatons, making the use of resources very obvious.

If you need an analogy to understand how an automated system could suffer from resources being consumed, perhaps picture a web server and billions of requests using a certain amount of bandwidth and CPU time each. Wait, now we're back to the original scenario!

There is no objective black and white is or is not in this situation.

There is litigation of multiple cases and a judge making a judgement on each one.

Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.

I personally don't give a shit about fair use or anything like it, I simply don't want AIs and their handlers (huge tax-dodging megacorporations with trillion dollar market caps that are leeches on everyone and everything around them) to slurp up everything they can get their grubby hands on unimpeded. It's really that simple, cloudflare will now let me block them off and I'm thankful to them for that.

I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.