Hacker News new | ask | show | jobs
by stilist 1147 days ago
I have zero technical understanding of the math or statistics, but looking at the graphs it seems suspicious that supposed jumps happen across unrelated tasks and models at the same scales--for example, in figure 1, the discontinuities are consistently in the 10^22 to 10^24 range. Obviously I'm just going by what the authors have chosen to include, but I'd expect more variation. At best I'd assume it's something about LLMs in general.
3 comments

The number of data points is tiny. There's only a handful of LLMs trained from scratch in the world, and sizes of models released in a "generation" tend to be close to each other somewhat. The field is very open source so people all over are building on top of the same shared literature. Plus I'm sure there are leaks very often and companies then rush to train their own pet architecture to whatever parameter size the competition is about to release.
I think that's just because there are only 2-3 points between 10^22 and 10^24, which is more about the data available (and that they have just seen dramatic improvements) than the measures or models themselves.
Could that be something to do with the things I keep reading about how somehow knowledge from, say, an LLM for generative text somehow carries over (in some way) to an LLM for image generation? I'm obviously not very knowledgeable in this area :).