|
|
|
|
|
by phire
436 days ago
|
|
BTW, I'd love to see a large model designed from scratch for efficient local inference on low-memory devices. While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens. Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model. |
|