| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dstroot 455 days ago
	Isn’t big LLM training data actually the most analogous to the internet archive? Shouldn’t the title be “Big LLM training data is a piece of history”? Especially at this point in history since a large portion of internet data going forward will be LLM generated and not human generated? It’s kind of the last snapshot of human-created content.

1 comments

antirez 455 days ago

The problem is, where is this 20T tokens that are being used for this task? No way to access them. I hope that at least OpenAI and a few more have solid historical storage of the tokens they collect.

link