| HN Mirror

I'm not aware of any "partioning" strategies per se (at least during inference), but it's now common practice to distill a larger model to a smaller one by either (a) training a smaller "student" network to replicate the larger "teacher" network, or (b) pruning smaller weights from the larger network to reduce the size.

Just brainstorming here, but a vanilla network partition strategy might be to load each layer's weight into memory and perform the forward pass sequentially. I think that would be prohibitively slow - some of these models (e.g. BERT) can already take up to 3-4 seconds to perform a single forward pass on a CPU, and that's with all model weights already loaded into main memory. I suspect fetching/loading each layer separately would blow this out by an order of magnitude.