| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by diggan 475 days ago

> The the question is if a llm will run with usable performance at that scale?

This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.

Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.

2 comments

radlad 475 days ago

It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205

bastardoperator 475 days ago

It's fast enough for me to cancel monthly AI services on a mac mini m4 max.

diggan 475 days ago

Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?

Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.

lostmsu 475 days ago

Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.

Matl 475 days ago

Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.

nomel 475 days ago

It's probably much worse than that, with the falling prices of compute.

staticman2 475 days ago

Smaller, dumber models are faster than bigger, slower ones.

What model do you find fast enough and smart enough?

Matl 475 days ago

Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.

bastardoperator 474 days ago

I'm running the same exact models.

a1o 475 days ago

How much RAM?

Matl 474 days ago

It takes between 22GB-37GB depending on the context size etc. from what I've observed.

jamesy0ung 475 days ago

I presume you're using the Pro, not the Max.

Anyways, what ram config, and what model are you using?

fetus8 475 days ago

How much RAM are you running on?

hangonhn 475 days ago

Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?

titzer 475 days ago

AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.

woadwarrior01 475 days ago

The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.

kridsdale1 475 days ago

I have to assume they’re doing something like that in the lab for 4 years from now.

azinman2 475 days ago

Memory bandwidth is the issue