|
|
|
|
|
by janalsncm
7 days ago
|
|
It’s a data point. I could imagine in a hardware constrained setting we might not care about training on enormous token counts, and on smaller devices it’s great if we can simplify the architecture. I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot. |
|
I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.