| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jesse__ 130 days ago

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

1 comments

FeepingCreature 130 days ago

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

link

jesse__ 130 days ago

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

link

onraglanroad 129 days ago

Because we can count? How could you possibly think that Wikipedia was 5% of the whole Internet? It's just such a bizarrely foolish idea.

link