Hacker News new | ask | show | jobs
by Der_Einzige 844 days ago
First it was longformer, and linear attention models. Then it was RWKV and now it's Mamba. So many bombastic claims of improved architectural performance - and no open source models that beat the thing they purport to beat. The proof is always in the pudding, and these models will remain a curiosity for most until their weights are being benchmarked favorably on LLM leaderboards.
4 comments

Yes, that's technically accurate. But I prefer to think of the entire LLM space as a new scientific field that started when OpenAI released ChatGPT.

In that context, all new research directions are valuable simply for the fact that they're expanding the foundation of the field. 5 years from now, who knows what the most effective models will use under the hood, but the more we can learn about them in general, the better.

lol I think in general, LLM research traces its origins back to all the standard deep learning techniques: NNs, CNNs, LSTMs, RNNs, etc.

In 2018, with the release of transformers (via google) it enabled much more rapid training of models and more generalization with less data. 100% of the LLMs (as you’d probably thing of them)trace their origins to BERT.

That said, my team was working with hundred million to low billions of parameter LSTMs & CNNs back in 2016-2017 that were comparable to some lighter weight LLMs today.

In my opinion, the greatest strides in the space has less to do with the underlying architecture, and more to do with improved data formatting, accessibility and compute improvements.

The field of research here is far older than ChatGPT's release. Neural network research has been going on for at least 50 years.

Most of the research that enabled ChatGPT was also already known. "Attention is all you need" was a 2017 paper.

It still is a fast evolving field, but not one that just kicked off.

True, but bear in mind the Mamba preprint is less than three months old. A lot of people are probably experimenting with these ideas right now and training a completely new, large foundation model with a different architecture will take a significant amount of time.
GPT3-176B cost $30 million dollars in compute plus millions in design, preprocessing, and operations. Then, it was able to perform as much better than prior architectures as it does today. You might want to include that in your challenge for competing models.

Let’s rephrase it. If their architecture is superior, and they have $30 million dollars, and similar preparation for training, and similar operational teams during training, then we can see if they can beat the model they’re comparing themselves to. Except, the alternatives don’t have tens of millions of dollars with the best support teams. So, the proof you seek hasn’t had a chance to happen due to severe lack of resources.

Hence, comparisons to GPT2 and small versions of GPT3. Even that might not be fair given the money and teams behind even small GPT3’s. Execution of the project is as critical for success as the model architecture.

Most (all?) open-ish 7B+ models today are finetunes of proprietary/semi-closed/bigbudget LLMs. There is no such foundation model for Mamba yet.