|
|
|
|
|
by int_19h
801 days ago
|
|
It doesn't beat RTX 4090 when it comes to actual LLM inference speed. I bought a Mac Studio for local inference because it was the most convenient way to get something fast enough and with enough RAM to run even 155b models. It's great for that, but ultimately it's not magic - NVidia hardware still offers more FLOPS and faster RAM. |
|
Sure, whisper.cpp is not an LLM. The 4090 can't even do inference at all on anything over 24GB, while ASi can chug through it even if slightly slower.
I wonder if with https://github.com/tinygrad/open-gpu-kernel-modules (the 4090 P2P patches) it might become a lot faster to split a too-large model across multiple 4090s and still outperform ASi (at least until someone at Apple does an MLX LLM).