Hacker News new | ask | show | jobs
by nightsd01 292 days ago
I am not an expert in AI by any means but I think I know enough about it to comment on one thing: there was an interesting paper not too long ago that showed if you train a randomly-initialized model from scratch on questions, like a bank of physics questions & answers, models will end up with much higher quality if you teach it the simple physics questions first, and then move up to more complex physics questions. This shows that in some ways, these large language models really do learn like we do.

I think the next steps will be more along this vain of thinking. Treating all training data the same is a mistake. Some data is significantly more valuable to developing an intelligent model than most other training data, even when you pass quality filters. I think we need to revisit how we 'train' these models in the first place, and come up with a more intelligent/interactive system of doing so

6 comments

From my personal experience training models this is only true when the parameter count is a limiting factor. When the model is past a certain size, it doesn't really lead to much improvement to use curriculum learning. I believe most research also applies it only to small models (e.g. Phi)
Wow. I really like this take. I've seen how time and time again nature follows the Pareto principle. It makes sense that training data would follow this principle as well.

Further that the order of training matters is novel to me and seems so obvious in hindsight.

Maybe both of these points are common knowledge/practice among current leading LLM builders. I don't build LLMs, I build on and with them, so I don't know.

This is precisely why chain of thought worked. Written thoughts in plain English is a much higher SNR encoding of the human brain's inner workings than random pages scraped from Amazon. We just want the model to recover the brain, not Amazon's frontend web framework.
A relevant paper: https://arxiv.org/abs/2306.11644 -- the Phi models (and many others too) are based on this idea.
I have never heard of order of training data matter in back propagation