Hacker News new | ask | show | jobs
by behnamoh 843 days ago
Can the low adoption of Mamba be attributed to what is being discussed today on HN (https://news.ycombinator.com/item?id=39491863)?

Basically, Nvidia et al. don't want the AI research to move in a direction that requires less GPU compute, less training data, and less inference compute.

Someone on HN (I don't remember the name) mentioned that the idea of deep learning is backed by big tech because it benefits them the most as they are the only players in town with huge amounts of data. If the AI community would find entirely different approaches to AGI (maybe not even learning), who do you think would suffer the most from the implications?

9 comments

This doesn’t make sense - there are literally thousands of academic AI research labs who are severely limited by compute resources. If anything could work better than transformers and require less compute they would be all over that.
I guess the argument is that most AI research is supported by the big tech, and they have heavily invested in the deep learning approach.

If the fundings were funneled to research groups working on alternative approaches, maybe we'd see the same amount of progress in AI only using another approach.

As a member of the research community: that's nonsense. Like already pointed out: academic groups (who by no means are dependent on big tech) would jump all over that. Mamba has been out long enough that you'd already see tons of papers at arxiv showing mamba dominating transformers in all sorts of applications. But that's not happening, despite the ton of hype. That doesn't mean that mamba is nonsense. Just that it isn't the immediate transformer killer. It remains to be seen if something comes from it, eventually.
As a member of the research community: that's nonsense. Publishing is an extremely noisy process in ML and is getting increasingly difficult for smaller non big tech collaborating labs. Reviewers' go to are: more datasets, scale, not novel. The easiest way to approach this is to work off of pretrained models. This is probably more obvious in the NLP world.

I agree that Mamba doesn't solve everything and it still needs work. But I disagree with the logic that there isn't an issue of railroading.

What’s the main difference between an ape’s brain and a human brain? Scale. So that’s the train we’re riding at the moment. No roadblocks yet, aside from cost.
> What’s the main difference between an ape’s brain and a human brain? Scale.

This is incredibly naive with absolutely no scientific basis. There is no evidence that this is in scale of data nor scale of architecture.

There are a number of animals with larger brains in terms of both mass and total number of neurons. An African Elephant has roughly 3x the number of neurons humans have. Dolphins beat humans in total surface area. Neanderthals are estimated to have had larger brains too! It isn't mass, neurons, neuron density, surface area. We aren't just scaled up chimps.

> Just that it isn't the immediate transformer killer.

What is the best/stable-ish linear alternative for transformer right now? Especially for text generation and summarization.

We have domain specific ways of over sampling and search, so we much prefer less expensive models.

As someone who's worked at several NVIDIA competitors, including Groq, I can guarantee you that, based on my knowledge, of existing products, they would be able to make much more money should they have lower memory footprint models. Given the amount of VC capital deployed for this (on the order of 100s of millions), I don't believe this is a reasonable take.

Sure, NVIDIA et al may not want that (although, again I don't see why... they too can't produce chips fast enough so being able to provide models for customers now ought to be good), but there's so much money out there that does...

Why would Meta, Microsoft, Amazon and Google want Nvidia to remain dominant in hardware? Are you treating “big tech” like they all have one hive mind?
For MSFT, AMZN, GOOG, the competitive advantage comes from having huge datasets (that Nvidia doesn't have). It's a symbiosis that benefits the data-rich and GPU-rich players.
This still makes little sense as that scale will always matter. If you can drop the compute cost of a model by 10x it means you can increase model integrity/intelligence/speed etc beyond what your compute bound competitors have.

Simply put, for the time being huge datasets are going to be needed and those with bigger (cleaner?) datasets will have a better behaving model.

Where is the symbiosis? If data is the differentiator, how do the data owners benefit from Nvidia eating into their margins?
It's sumbiosis based on a common factor they both appreciate:

- the data/processing having to be large means the data-owners have a benefit

- the data/processing having to be large means NVIDIA also has a benefit (sells more GPUs to handle all that load)

The fact that 'removing the “quadratic bottleneck”' involves either reduced expressability compared to self attention or disproving SETH is another reason.

The quadratic bottleneck is due to the lower bounds of exhaustive search.

The papers on this only ever seem to reference perplexity.

The fact it can append a word to "I'm going to the beach" that sounds good doesn't mean it is useful.

There is no free lunch, and this project hasn't shown that the costs are acceptable.

"I'm going to the beach" + house

Doesn't help if what you needed was

"I'm going to the beach" + tomorrow

I do hope that there is more information on the costs, or that they have disproven SETH soon.

What's SETH in this context? I googled to no avail.
Strong Exponential Time Hypothesis

Here is how it relates to attention.

https://arxiv.org/abs/2209.04881

It is a really recent development. Even if this architecture is technically superior, it could take time before a model using it becomes competitive.

Or maybe it does not pan out at all. We are still at the stage where people are throwing everything at the wall to see what sticks. Some promising ideas which work at small scale do not work at bigger.

This. Hyperparameter tuning and training include a lot of model specific black magic. Transformers have had time to mature, it'll take a while for other stuff to catch up even if they have a higher potential ceiling.
Definitely agree that a lot of work going into hyperparameter tuning and maturing the ecosystem will be key here!

I'm seeing the Mamba paper as the `Attention Is All You Need` of Mamba - it might take a little while before we get everything optimised to the point of a GPT-4 (it took 6 years for transformers but should be faster than that now with all the attention on ML)

Another interesting one is that the hardware isn't really optimised for Mamba yet either - ideally we'd want more of the fast SRAM so that we can store more larger hidden states efficiently
Yes and no.

The thing is that Mamba is not perfect. There's no neural architecture to rule them all, if you will. I think the bigger issue is that we more act like there is and get on bandwagons. Let me give a clearer example from the past so we can see. The predecessor to DDPM (the work that kicked off the diffusion model era) was published in 2015[0], only a year after GANs[1]. Diffusion then showed promise but why did DDPM come out in 2020[2]? Because everyone was working on GANs. GANs produced far better images and diffusion (still is) was a lot more computationally intensive. Plus, all the people working on these diffusion models were in the same camp as those working on Normalizing Flows and other density based models, and fewer people are interested in understanding density estimation.

So the entire problem is that the community hopped on a singular railroad for research direction. There was still working going on in that direction but it wasn't nearly getting the attention that GANs got. It's hard to know if things were blocked from publication because they weren't as good as GANs. I can say from personal experience I had a Flow publication blocked because reviewers were concerned with its quality compared to GANs (this was 2019/2020, this paper will never be published because now it is even hard to publish a GAN work).

So yes and no because there is certainly railroading happening but there are also real critiques to Mamba. But what people often forget is that it is incredibly naive to compare new methods to existing methods on a direct one-to-one comparison. You're comparing something that has hundreds of hours to thousands of hours from a handful to a few dozen eyes against works with millions of hours and millions of eyes. Evaluation is just a really fucking hard thing to do but it is easy to just look at some benchmarks, even if they don't mean much. This is a fairly generalization notion though, so take the lesson to heart. But Mamba seems a bit different than our diffusion/GAN story, in that it is getting more attention than diffusion did in the 2016-2019.

[0] https://arxiv.org/abs/1503.03585

[1] https://arxiv.org/abs/1406.2661

[2] https://arxiv.org/abs/2006.11239

I don't really think big tech have that much control. They are aiming for optimal profit solution, thing likes AI monopoly with huge self-supervised learning just popped up recently when ChatGPT performs really well, a couple years ago, people still believed modular and supervised learning is the key to AI application. So simply current scaling deep learning/llm is most promising and it works while tradional methods don't. If there is something that works as good as current solution and requires less resource, they will go for it very fast, see the implementation of Flash attention as an example.
Low adoption is primarily caused by it being relatively recent and there are no 7B or larger public Mamba-based models to start a comparison in earnest with widely used transformer based LLMs.
It’s less compute for the same model sizes. Rest assured that there will still be a race to scale model sizes (and data) to achieve better performance.
This early in the game? No. If LLMs become vastly cheaper and faster then adoption (and model size) will increase in line with that.
No these things just take time.

There is no conspiracy again efficient training. Companies aren’t going to lower compute budgets with more efficiency.

All the top labs are increasing efficiency, but they are using that to get more out of their large runs not spend less. Most companies have a relatively fixed training budget for their large runs and are trying to get the most out of it, bot save money,

Mamba is actually being scaled up and tested across other fields(bio) at a rapid pace compared to other architectures

> There is no conspiracy

Fwiw, the OP isn't suggesting conspiracy. The notion is more about convergent thinking.