| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by penagwin 2428 days ago

My understanding is that a lot of these really high performance models that reach for every percentage-point possible require an absurd amount of hardware - specifically an absurd amount of GPU memory.

For example I have what I consider a fairly "high end" rig for being a hobbyist individual, with 32GB of RAM, i7 8700k, 1080ti - there's 0 chance their model would fit on my system.

So I mean maybe if you have a ton of money? Usually what happens is a slimmer model with not "quite" as high of a score gets released that actually fits on consumer hardware.

1 comments

vagab0nd 2428 days ago

Maybe I'm oversimplifying, but it seems to me that once you have the model trained, it should be possible to partition it somehow when inferencing, to fit smaller machines. At least for a proof of concept it should be possible.

link

nmfisher 2427 days ago

I'm not aware of any "partioning" strategies per se (at least during inference), but it's now common practice to distill a larger model to a smaller one by either (a) training a smaller "student" network to replicate the larger "teacher" network, or (b) pruning smaller weights from the larger network to reduce the size.

Just brainstorming here, but a vanilla network partition strategy might be to load each layer's weight into memory and perform the forward pass sequentially. I think that would be prohibitively slow - some of these models (e.g. BERT) can already take up to 3-4 seconds to perform a single forward pass on a CPU, and that's with all model weights already loaded into main memory. I suspect fetching/loading each layer separately would blow this out by an order of magnitude.

link

penagwin 2428 days ago

The problem is that there is so many weights in the model that they don't fit in memory. You can lower the number of weights, which will lower the effectiveness of the model.

The thing is that when you're going for leaderboards you're reaching for every last percentage point, so the efficiency of the model size/performance isn't a concern, you want to ramp up the resource usage to as you have access to.

TL;DR - Yeah basically most people will run a "slimmed down" version of the model that isn't "as" performant, but is still an improvement over previous models and actually fits on your machine.

link