| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Wowfunhappy 388 days ago
	I'm also assuming. But I would ask the opposite question: why store all that data if you'll have to scrape again anyway? You will have to scrape again because you want the next AI to get trained on updated data. And, even at the scale needed to train an LLM, storing all of the text on the entire known internet is a very non-trivial task!

1 comments

anonymoushn 387 days ago

If you try to reproduce various open datasets like fineweb by scraping the pages again, you can't, because a lot of the pages no longer exist. That's why you would prefer to store them instead of losing the content forever.

It's not "all of the text", it's like less than 100 trillion tokens, which means less than 400TB assuming you don't bother to run the token streams through a general purpose compression algorithm before storing them.

link