|
|
|
|
|
by zozbot234
382 days ago
|
|
You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows) |
|