Hacker News new | ask | show | jobs
by janice1999 539 days ago
> a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.

What kind of hardware do you need to run this?

2 comments

They discuss it in the paper and recommend 32 GPUs (H800 in their case) for prefill stage and 320 GPUs for decoding.

=)