Hacker News new | ask | show | jobs
by bick_nyers 585 days ago
You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.