|
|
|
|
|
by visarga
1037 days ago
|
|
The inspiration for weight decay was to reduce the capacity to memorize of the model until it perfectly fits the complexity of the task, not more not less. A model more complex than the task is over-fitting, the other one is under-fitting. Got to balance them out. But the best cure for over-fitting is to make the dataset larger and ensure data diversity. LLMs have datasets so large they usually train one epoch. |
|
Just by trying to make the dataset diverse you could skew things to not reflect reality. I just don't think enough attention has been paid to the data, and too much the model. But I could be very wrong.
There is a natural temporality to the data humans receive. You can't relive the same moment twice. That said, human intelligence is on a scale too and may be affected in the same way.