|
|
|
|
|
by crosen99
1129 days ago
|
|
I'm surprised not to see anything about data-to-parameter ratios for optimal scaling. My superficial understanding per the Chinchilla paper is to target 20 to 1. I'm also confused about this: > ~$1 million: Cost to train a 13 billion parameter model on 1.4 trillion tokens This is apparently related to the LLaMa paper, but that paper seems to cite 1.0T tokens (rather than 1.4T tokens) for the 13B model. Also, if 20 to 1 is in fact optimal for the data-to-parameter ratio, then using a 100 to 1 ratio doesn't seem like an appropriate way to arrive at a magic number for training costs. The magic number should really be based on an optimal configuration. Or, perhaps, my superficial understanding here leads me to miss some important distinctions. |
|