Hacker News new | ask | show | jobs
by Octoth0rpe 323 days ago
Given Meta's history of torrenting every book it could get its hands on for training, I'm not convinced that the majority of AI companies would respect that license. Maybe if we also had a better way to prove that such code was part of the training set and see a couple of solid legal victories with compensation awarded.
2 comments

I'm pretty astounded that "The Stack" at least did and effort, and continue to do so by weeding out GPL or similar strong copyleft source code from their trove, and even implemented an opt-out mechanism [0].

They look like saints when compared to today's companies.

[0]: https://huggingface.co/spaces/bigcode/in-the-stack

They're also getting sued for it, and the judge ruled they had no right to torrent those books so now it's just a matter of calculating how many trillions Meta has to pay, then extracting it from them.
Because Meta got caught. I'm not convinced that every random OSS lib will have the resources to audit every model out there for a hypothetical GPL+no training violation.