| HN Mirror

The point about training data stands. We usually only think of scaling compute, but we need to scale data as well, maybe even faster than compute. But we exhausted the source of high quality organic text, and it doesn't grow exponentially fast.

I think at the moment the best source of data is the chat log, with 1B users and over 1T daily tokens over all LLMs. These chat logs are at the intersection of human interests and LLM execution errors, they are on-policy for the model, right what they need to improve the next iteration.