|
|
|
|
|
by WhitneyLand
723 days ago
|
|
The paper suggests on one hand Gemma is on the same Pareto curve as Llama3, while on the other hand seems to suggest it’s exceeded its efficiency. Is this a contradiction or am I misunderstanding something? Btw overall very impressive work great job. |
|
However, I wouldn't draw conclusions about different model families, like Llama and Gemma, based on their token count alone. There are many other variables at play - the quality of those tokens, number of epochs, model architecture, hyperparameters, distillation, etc. that will have an influence on training efficiency.