|
|
|
|
|
by ynniv
538 days ago
|
|
It's overfitting when you train too large a model on too many details. Rote memorization isn't rewarding. The more concepts the model manages to grok, the more nonlinear its capabilities will be: we don't have a data problem, we have an educational one. Claude 3.5 was safety trained by Claude 3.0, and it's more coherent for it. https://www.anthropic.com/news/claudes-constitution |
|
It’s why many pre-processing steps for image training pipelines will add copies of images at weird rotations, amounts of blur, and different cropping.
> The more concepts the model manages to grok, the more nonlinear its capabilities will be
These kind of hand wavey statements like “practice,” “grok,” and “nonlinear its capabilities will be” are not very constructive as they don’t have solid meaning wrt language models.
So earlier when I was referring to compounding bias in synthetic data I was referring to a bias that gets trained on over and over and over again.
That leads to overfitting.