|
|
|
|
|
by cubefox
1126 days ago
|
|
Both increasing the amount of parameters and the amount of training tokens improves results (more precisely: lowers training loss), and costs computing power. For optimally improving loss per training computing power, model size and training tokens should be increased equally. That's the Chinchilla scaling law. (Though low loss is not always the same as good results, the data quality also matters.) Further reading: https://dynomight.net/scaling/ |
|