|
|
|
|
|
by lostmsu
5 days ago
|
|
Before batching. The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass. That's why it is impractically slow. Batching could allow you to generate 10 tokens for 10 different conversations at the time, but it also means that you need to load different experts for different tokens, so it does not help as much as it does for dense models. |
|
On top of that (as previously pointed out by zoz) for a single user running a single overarching task the choice of experts is expected to be highly biased.