Hacker News new | ask | show | jobs
by nl 1319 days ago
This is some impressive work.

You might like to look at the work HuggingFace has been doing (on non-iOS versions). They can run it in under 1GB RAM:

> If is also possible to chain it with attention slicing for minimal memory consumption, running it in as little as < 800mb of GPU vRAM

https://huggingface.co/docs/diffusers/optimization/fp16#offl...

1 comments

CPU offloading doesn't work because Apple has shared memory arch already. The head slicing is similar to https://machinelearning.apple.com/research/neural-engine-tra... I think it is quite practical only if MPSGraph less mysterious about its allocation strategy. It is not the ideal way though. Ideally, FlashAttention / XFormer is the way to go.