|
|
|
|
|
by jn5
634 days ago
|
|
I am pretty sure they do, this data is just too valuable. At least meta admitted using a dataset called "books3" which contains ~200k pirated ebooks for llama 1 and 2 [1].
Anna's archive provides datasets for LLM training, but who knows who they are working with.. I also wonder if google is using their own dataset from books.google.com . [1] https://torrentfreak.com/meta-admits-use-of-pirated-book-dat... |
|