Hacker News new | ask | show | jobs
by Pigalowda 1226 days ago
1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.

1 comments

But its not just words…
I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.