|
|
|
|
|
by qeternity
807 days ago
|
|
4bit should take up less than this, there are quite a few shared parameters between experts. But unless you’re running bs=1 it will be painful vs 8x GPU as you’re almost certain to be activating most/all of the experts in a batch. |
|