Hacker News new | ask | show | jobs
by dstroot 455 days ago
Isn’t big LLM training data actually the most analogous to the internet archive? Shouldn’t the title be “Big LLM training data is a piece of history”? Especially at this point in history since a large portion of internet data going forward will be LLM generated and not human generated? It’s kind of the last snapshot of human-created content.
1 comments

The problem is, where is this 20T tokens that are being used for this task? No way to access them. I hope that at least OpenAI and a few more have solid historical storage of the tokens they collect.