|
|
|
|
|
by reitzensteinm
125 days ago
|
|
I think part of the issue is that in production deployments, you're batching high enough that you'll be paging in those long tail experts constantly. Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout. It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that. |
|