Everyone's obsessed with new training tokens... It doesn't need to be more knowledgeable, it just needs to practice more. Ask any student: practice is synthetic data.
Overfitting can be caused by a lot of different things. Having an over abundance of one kind of data in a training set is one of those causes.
It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.
> The more concepts the model manages to grok, the more nonlinear its capabilities will be
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So, here's my hypothesis, as someone who is adjacent ML but haven't trained DNNs directly:
We don't understand how they work, because we didn't build them. They built themselves.
At face value this can be seen as an almost spiritual position, but I am not a religious person and I don't think there's any magic involved. Unlike traditional models, the behavior of DNNs is based on random changes that failed up. We can reason about their structure, but only loosely about their functionality. When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers. Given this, there will not be a direct correlation between inputs and capabilities, but some arrangements do work better than others.
If this is the case, high order capabilities should continue to increase with training cycles, as long as they are performed in ways that don't interfere with what has been successfully learned. People lamented the loss of capability that GPT 4 suffered as they increased safety. I think Anthropic has avoided this by choosing a less damaging way to tune a well performing model.
> We don't understand how they work, because we didn't build them. They built themselves.
We do understand how they work, we did build them.
The mathematical foundation of these models are sound. The statistics behind them are well understood.
What we don’t exactly know is which parameters correspond to what results as it’s different across models.
We work backwards to see which parts of the network seem to relate to what outcomes.
> When they get better at drawing, it isn't because we taught them to draw. When they get better at reasoning, it isn't because the engineers were better philosophers.
Isn’t this the exact opposite of reality?
They get better at drawing because we improve their datasets, topologies, and their training methods and in doing so, teach them to draw.
They get better at reasoning because the engineers and data scientists building training sets do get better at philosophy.
They study what reasoning is and apply those learnings to the datasets and training methods.
And who will tell the model whether its practice results are correct or not? Students practice against external evaluators, it’s not a self-contained system.
synthetic data is fine if you can ground the model somehow. that's why the o1/o3's improvements are mostly in reasoning, maths, etc., because you can easily tell if the data is wrong or not.