| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by janice1999 539 days ago
	> a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. What kind of hardware do you need to run this?

2 comments

8x H200s recommended:

They discuss it in the paper and recommend 32 GPUs (H800 in their case) for prefill stage and 320 GPUs for decoding.