| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Covenant0028 430 days ago

Another thing that would permanently taint models and open their creators to lawsuits is if they were trained on many terabytes worth of pirated ebooks. Yet that didn't seem to stop Meta with Llama[0]. This industry is rife with such cases; OpenAI's CTO famously could not answer a simple question about whether Sora was trained on Youtube data or not. And now it seems they might be trained on video game content [1], which opens up another lawsuit avenue.

The key question from the perspective of the company is not whether there will be lawsuits, but whether the company will get away with it. And so far, the answer seems to be: "yes".

The only exception that is likely is private repos owned by enterprise customer. It's unlikely that GitHub would train LLMs on that, as the customer might walk away if they found out. And Fortune 500 companies have way more legal resources to sue them than random internet activists. But if you are not a paying customer, well, the cliche is that you are the product.

[0]: https://cybernews.com/tech/meta-leeched-82-terabytes-of-pira... [1]: https://techcrunch.com/2024/12/11/it-sure-looks-like-openai-...