|
|
|
|
|
by pk-protect-ai
916 days ago
|
|
> is Mamba potentially more compute-efficient to train for a similar size of model/dataset? I would like to understand it too as well ... Here is the citation from original paper: "Computation. After the parameters have been transformed from (∆, A, B, C) ↦ (A, B, C), the model can be computed in two ways, either as a linear recurrence (2) or a global convolution (3). Commonly, the model uses the convolutional mode (3) for efficient parallelizable training (where the whole input sequence is seen ahead of time), and switched into recurrent mode (2) for efficient autoregressive inference (where the inputs are seen one timestep at a time)." So the training is parallelizable, like in RetNet with parallel forward mode.
By default inference is done in the recurrent mode, to have a longest possible context. No chunking available, so it is difficult for me to say how much RAM and VRAM it will consume during the inference ... |
|