Hacker News new | ask | show | jobs
by paxys 639 days ago
> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.

4 comments

Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

Its not a legal question but a behavior and sustainability question. If it is fair use, but is undesirable for content makers, then they’re still not under any obligation to allow scraping. So they’ll try stuff like this, and other more restrictive bot blockers.

Remember when news sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.

In the end the websites always lose that battle if humans are willing to put effort into sharing it. You see people just pasting full article text or summaries into reddit comments. Those people are probably subscribers.
Copyright is only part of the equation, there's also the use of other people's resources

If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network

The amount of scraping that happened in ~2020 as compared to 2024 is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei's singapore department who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server, and is therefore the only entry that my robots.txt denies)

But here we're talking about Common Crawl being included in this scheme, which is explicitly designed to make it easier to use them than to make your own bad robot.

You block Common Crawl and all you'll be left with is the abusive bots that find workarounds.

> you still can't walk into their office thousands of times per day

why not?

Esp. if that receptionist is an automaton, and isn't bothered by you. Of course, if you end up taking more resources and block others from asking as well, then you need to observe some etiquette (aka, throttle etc).

> why not? Esp. if that receptionist is an automaton, and isn't bothered by you

I chose "thousands" to keep it within the realm of possibility while making it clear that it would bother a human receptionist precisely because humans aren't automatons, making the use of resources very obvious.

If you need an analogy to understand how an automated system could suffer from resources being consumed, perhaps picture a web server and billions of requests using a certain amount of bandwidth and CPU time each. Wait, now we're back to the original scenario!

There is no objective black and white is or is not in this situation.

There is litigation of multiple cases and a judge making a judgement on each one.

Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.

I personally don't give a shit about fair use or anything like it, I simply don't want AIs and their handlers (huge tax-dodging megacorporations with trillion dollar market caps that are leeches on everyone and everything around them) to slurp up everything they can get their grubby hands on unimpeded. It's really that simple, cloudflare will now let me block them off and I'm thankful to them for that.

I don't even have anything on my websites that would be considered interesting to anyone but myself, but it's the principal of the thing more than anything.

The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.

[1] https://free.law/recap/faq

Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

Licensing doesn't mean shit when no court in the country is actually willing to prosecute violations. Who have OpenAI, Anthropic, Microsoft, Google, Meta licensed all their training data from?
Copyright infringement is a civil matter.
And where do you think civil matters are handled?
In the U.S., civil cases are litigated by opposing attorneys in front of a judge, often without a jury, which differs from criminal cases led by prosecutors. Prosecutors (e.g., local DAs, AGs, DOJ) handle criminal trials, not civil ones like (usually) IP infringement.

If people are exploiting your work unfairly, it's on you to take legal action in civil court. Just be aware the statute of limitations is short (often 1-4 years depending on the state), so consult a real attorney quickly. (I'm not a lawyer, so this isn't legal advice!)

I mean, this is exactly what people like myself were predicting when these AI companies first started spooling up their operations. Abuse of the public square means that public goods are then restricted. It's perfectly rational for websites of any sort who have strong opinions on AI to forbid the use of common crawl, specifically because it is being abused by AI companies to train the AI's they are opposed to.

It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.