| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by naillo 1226 days ago
	Never realized how little data it was fed with. 570GB can fit on my laptop.

2 comments

shagie 1226 days ago

The original dataset was 45 TB.

The neural net model is condensed to 800 GB.

https://www.springboard.com/blog/data-science/machine-learni...

Note that the "compression" there also includes the "intelligence" that it presents - you might be able to get some powerful compression of English text... but you can't ask a gzip file to come up with a joke about cats and dinosaurs.

link

Pigalowda 1226 days ago

1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.

link

bdhcuidbebe 1226 days ago

But its not just words…

link

Pigalowda 1226 days ago

I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.

link