Hacker News new | ask | show | jobs
by rcxdude 389 days ago
Even if training on the copyrighted material is OK, just providing a data dump of it almost certainly is not.
1 comments

No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.
providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.
Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.
...no? They also use web crawlers.
The datasets are collected using web crawlers, but that doesn’t tell us anything about how they are stored and re-distributed, right?
Why would you store the data after training?