Hacker News new | ask | show | jobs
by naillo 1226 days ago
Never realized how little data it was fed with. 570GB can fit on my laptop.
2 comments

The original dataset was 45 TB.

The neural net model is condensed to 800 GB.

https://www.springboard.com/blog/data-science/machine-learni...

Note that the "compression" there also includes the "intelligence" that it presents - you might be able to get some powerful compression of English text... but you can't ask a gzip file to come up with a joke about cats and dinosaurs.

1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.

But its not just words…
I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.