| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ml_hardware 1592 days ago

At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)

But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.

But yeah the latency will be insanely fast given how massive these models are!

1 comments

TOMDM 1592 days ago

So, we're about a 25-50% memory increase off of being able to run GPT3 on a single machine?

Sounds doable in a generation or two.

link

ml_hardware 1592 days ago

Couple points:

1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.

2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.

MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).

So it's complicated!

link

TOMDM 1592 days ago

Oh I totally expect the size of models to grow along with whatever hardware can provide.

I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.

Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.

You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?

link

ml_hardware 1592 days ago

I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.

A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.

Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...

link

algo_trader 1592 days ago

Is it difficult/desirable to squeeze/compress an open-sourced 200B parameter model to fit into 40GB?

Are these techniques for specific architectures or can they be made generic ?

link

ml_hardware 1592 days ago

I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.

See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93...

But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality.

link

algo_trader 1592 days ago

Ah, found some stuff already

https://www.tensorflow.org/model_optimization/guide/pruning

https://www.tensorflow.org/model_optimization/guide/pruning/...

link