Y
Hacker News
new
|
ask
|
show
|
jobs
by
Alifatisk
300 days ago
I think it's because of a combination between the MoE model architecture and the inference done in large batches and run in parallel