|
|
|
|
|
by ml_hardware
1544 days ago
|
|
Couple points: 1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB. 2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page. MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system). So it's complicated! |
|
I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.
Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.
You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?