|
|
|
|
|
by chsasank
907 days ago
|
|
This is basically fork of llama.cpp. I created a PR to see the diff and added my comments on it: https://github.com/ggerganov/llama.cpp/pull/4543 One thing that caught my interest is this line from their readme: > PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. Apple's Metal/M3 is perfect for this because CPU and GPU share memory. No need to do any data transfers. Checkout mlx from apple: https://github.com/ml-explore/mlx |
|