|
|
|
|
|
by kristianp
1051 days ago
|
|
It could be good if the relevant expert(s) can be loaded on demand after reading the prompt? If the MOE is, say 8x8b params, then you could get good speed out of a 12GB GPU, despite the model being 64 params in size. Or am I misunderstanding how this all works? |
|