|
|
|
|
|
by miven
914 days ago
|
|
I wonder what would be the most efficient tactic for offloading select layers of such a model to a GPU within a memory-constrained system As far as I understand usually layer offloading in something like llama.cpp loads the first few consecutive layers to VRAM (the remainder being processed in the CPU) such that you don't have too much back and forth between the CPU and GPU. I feel like such an approach would lead to too much wasted potential in terms of GPU work when applied to a SMoE model, but on the other hand offloading non-consecutive layers and bouncing between the two processing units too often may be even slower... |
|
An nvidia 4090 has a memory bandwidth of 1008 GB/s [2] i.e. 11x as much.
Using these together is like a parcel delivery which goes 10 miles by formula 1 race car, then 10 miles on foot. You don't want the race car or the handoff to go wrong, but in terms of the total delivery time they're insignificant compared to the 10 miles on foot.
I'm not sure there's much potential for cleverness here, unless someone trains a model specifically targeting this use case.
[1] https://www.intel.com/content/www/us/en/products/sku/230502/... [2] https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-GPU-Be...