|
|
|
|
|
by vagab0nd
2428 days ago
|
|
Maybe I'm oversimplifying, but it seems to me that once you have the model trained, it should be possible to partition it somehow when inferencing, to fit smaller machines. At least for a proof of concept it should be possible. |
|
Just brainstorming here, but a vanilla network partition strategy might be to load each layer's weight into memory and perform the forward pass sequentially. I think that would be prohibitively slow - some of these models (e.g. BERT) can already take up to 3-4 seconds to perform a single forward pass on a CPU, and that's with all model weights already loaded into main memory. I suspect fetching/loading each layer separately would blow this out by an order of magnitude.