Hacker News new | ask | show | jobs
by jn5 634 days ago
I am pretty sure they do, this data is just too valuable. At least meta admitted using a dataset called "books3" which contains ~200k pirated ebooks for llama 1 and 2 [1]. Anna's archive provides datasets for LLM training, but who knows who they are working with..

I also wonder if google is using their own dataset from books.google.com .

[1] https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...