| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anonymoushn 389 days ago
	providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.

1 comments

echoangle 389 days ago

Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.

link

Wowfunhappy 389 days ago

...no? They also use web crawlers.

link

bee_rider 388 days ago

The datasets are collected using web crawlers, but that doesn’t tell us anything about how they are stored and re-distributed, right?

link

Wowfunhappy 388 days ago

Why would you store the data after training?

link

bee_rider 388 days ago

Are you saying that you know they don’t store the data after training?

I’d just assume they did because—why scrape again if you want to train a new model? But if you know otherwise, I’m not tied to this idea.

link

Wowfunhappy 388 days ago

I'm also assuming. But I would ask the opposite question: why store all that data if you'll have to scrape again anyway?

You will have to scrape again because you want the next AI to get trained on updated data. And, even at the scale needed to train an LLM, storing all of the text on the entire known internet is a very non-trivial task!

link