|
|
|
|
|
by mdp2021
424 days ago
|
|
> Your data really isn't that useful anyway ? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory... Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible. |
|
That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.
Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933
> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.