|
|
|
|
|
by liuliu
1311 days ago
|
|
CPU offloading doesn't work because Apple has shared memory arch already. The head slicing is similar to https://machinelearning.apple.com/research/neural-engine-tra... I think it is quite practical only if MPSGraph less mysterious about its allocation strategy. It is not the ideal way though. Ideally, FlashAttention / XFormer is the way to go. |
|