|
|
|
|
|
by nmfisher
2427 days ago
|
|
I'm not aware of any "partioning" strategies per se (at least during inference), but it's now common practice to distill a larger model to a smaller one by either
(a) training a smaller "student" network to replicate the larger "teacher" network, or
(b) pruning smaller weights from the larger network to reduce the size. Just brainstorming here, but a vanilla network partition strategy might be to load each layer's weight into memory and perform the forward pass sequentially. I think that would be prohibitively slow - some of these models (e.g. BERT) can already take up to 3-4 seconds to perform a single forward pass on a CPU, and that's with all model weights already loaded into main memory. I suspect fetching/loading each layer separately would blow this out by an order of magnitude. |
|