Hacker News new | ask | show | jobs
by billyhoffman 641 days ago
Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

1 comments

Licensing doesn't mean shit when no court in the country is actually willing to prosecute violations. Who have OpenAI, Anthropic, Microsoft, Google, Meta licensed all their training data from?
Copyright infringement is a civil matter.
And where do you think civil matters are handled?
In the U.S., civil cases are litigated by opposing attorneys in front of a judge, often without a jury, which differs from criminal cases led by prosecutors. Prosecutors (e.g., local DAs, AGs, DOJ) handle criminal trials, not civil ones like (usually) IP infringement.

If people are exploiting your work unfairly, it's on you to take legal action in civil court. Just be aware the statute of limitations is short (often 1-4 years depending on the state), so consult a real attorney quickly. (I'm not a lawyer, so this isn't legal advice!)