Hacker News new | ask | show | jobs
by pk-protect-ai 916 days ago
> is Mamba potentially more compute-efficient to train for a similar size of model/dataset?

I would like to understand it too as well ...

Here is the citation from original paper:

"Computation. After the parameters have been transformed from (∆, A, B, C) ↦ (A, B, C), the model can be computed in two ways, either as a linear recurrence (2) or a global convolution (3). Commonly, the model uses the convolutional mode (3) for efficient parallelizable training (where the whole input sequence is seen ahead of time), and switched into recurrent mode (2) for efficient autoregressive inference (where the inputs are seen one timestep at a time)."

So the training is parallelizable, like in RetNet with parallel forward mode. By default inference is done in the recurrent mode, to have a longest possible context. No chunking available, so it is difficult for me to say how much RAM and VRAM it will consume during the inference ...

1 comments

I did some minimal testing, mamba uses about 60% of VRAM in comparison to RetNet (parallel forward mode) with the model of the same size and the vocabulary of same size during inference.