Hacker News new | ask | show | jobs
by filterfiber 916 days ago
> Previous State-of-the-Art: [...] The number of parameters in the LSTM layers of these models vary from 2 million to 151 million.

> We present model architectures in which a MoE with up to 137 billion parameters

Back in 2017 most models were well under 1B, GPT2 (2019) was one of the first "big" non-MOE models at 1.5B in size. People weren't sure how well/much they would scale.

The CoralAI TPU has a mere 8 MB of SRAM in 2019!

GPT3 was 175B in 2020.

Now nearly all LLM's are at minimum 1B, but dense 70B is now common.

1 comments

It also is a good reminder to revisit a lot of ideas and to contextualize many works appropriately. We've seen that __in general__, independent of architecture, model output quality increases as model parameters scale and data scales; under the assumption that data quality is sufficiently good and does not degrade with scale (quality of data is exceptionally important).

I find that this a common misinterpretation of a lot of papers and works done in the research community, especially by but far from exclusive to practitioners. There's a trend that big companies will simply out compute other models/methods and those results will be taken with the premise that the architecture is better. But if your model is only better because you out hyper-parameter tuned your model compared to another work, is your model actually better? We've seen extremely strong evidence that even the research community buys into hype as we've still seen that CNNs, when using training techniques similar to ViTs and similar parameter counts, perform just as well as transformer based models.

We likely leave a lot of potentially valuable models and architectures to rot because we don't properly contextualize our reading of works. I'd love to see universities without big tech partners explore new generative models but it's hard for them to pass review when rejection is as simple as "performs worse than model 1000x its size that uses massive pretraining and cost $2m to train," "not enough datasets to be convincing," (different from "needs x,y,z datasets to properly explore x',y',z' domains") or "but does it scale?" These all are "pay to play" responses and I think if anything, we've seen that the big boom in ML has actually been from letting people "fuck around and find out." But my main concern is we're pushing harder towards playing around with pretrained models (which are often proprietary) rather than doing this but also exploring new techniques. Context is everything and there are big differences between a paper attempting to be SOTA and a work trying to explore different ideas (see NASA TRL). It's very easy to get caught up in the hype and lose sight, because evaluation is an exceptionally difficult task (if it were easy, we'd let LLMs review, but please for the love of god, no).