|
|
|
|
|
by valine
472 days ago
|
|
With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory. |
|