| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mdp2021 424 days ago

> Your data really isn't that useful anyway

? One single random document, maybe, but as an aggregate, I understood some parties were trying to scrape indiscriminately - the "big data" way. And if some of that input is sensitive, and is stored somewhere in the NN, it may come out in an output - in theory...

Actually I never researched the details of the potential phenomenon - that anything personal may be stored (not just George III but Random Randy) -, but it seems possible.

1 comments

simonw 424 days ago

There's a pretty common misconception that training LLMs is about loading in as much data as possible no matter the source.

That might have been true a few years ago but today the top AI labs are all focusing on quality: they're trying to find the best possible sources of high quality tokens, not randomly dumping in anything they can obtain.

Andrej Karpathy said this last year: https://twitter.com/karpathy/status/1797313173449764933

> Turns out that LLMs learn a lot better and faster from educational content as well. This is partly because the average Common Crawl article (internet pages) is not of very high value and distracts the training, packing in too much irrelevant information. The average webpage on the internet is so random and terrible it's not even clear how prior LLMs learn anything at all.

mdp2021 424 days ago

Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).