| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Aurornis 77 days ago

Excellent article.

The game benchmarks are fun but the LLM improvements are where this gets really interesting for practical use. I love Apple platforms as an approachable way to run local models with a lot of RAM, but their relatively slow prompt processing speed is often overlooked.

> Here you can see the big issue with Macs: the prompt processing (aka “prefill”) speed. It just gets worse and worse, the longer the prompt gets. At a 4K-token prompt, which doesn’t seem very long, it takes 17 seconds for the M4 MacBook Air to parse before we even start generating a response. Meanwhile, if you strap the eGPU to it, it’ll only take 150ms. It’s 120x faster.

The prefill problem goes unnoticed when you’re playing around with the LLM with small chats. When you start trying to use it for bigger work pieces the compute limit becomes a bottleneck.

The time to first token (TTFT) charts don’t look bad until you notice that they had to be shown on a logarithmic scale because the Mac platforms were so much slower than full GPU compute.

3 comments

superlopuh 77 days ago

I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac? To elaborate, the article just says that this step is compute bound, but I'm wondering whether it is just that simple or if it might also be less optimised in MLX?

Aurornis 77 days ago

Prefill (prompt processing) is compute bound doing large matrix operations. Token generation (aka tokens/s) is memory bandwidth bound.

The RTX 5090 has an incredible amount of compute performance for matrix operations and a lot of memory bandwidth. The Apple Silicon parts have unusually high memory bandwidth for general purpose compute chips, which is why they can generate tokens so fast. Their raw matrix compute performance is amazing for their power envelope but not nearly as fast as a dedicated GPU consuming 400-500W.

Apple added tensor cores on the M5 generation which help with those matrix operations, which is why the M5 performs so much better than the M4 Max in that article.

Dedicate GPUs like the RTX 5090 are in another league, though.

You can see the divergence in the high resolution gaming benchmarks, too. Once he starts benchmarking at 4K or 6K where the CPU emulation stops being a bottleneck, the raw compute of the 5090 completely crushes any of the Apple Silicon GPUs.

PicardsFlute 75 days ago

The TTFT benchmarks don’t look right to me. I don’t use vLLM, but at 16k pre-fill, the M5 Max is 3.6 times faster than the M4 Max. The 5090 is surely faster, but the numbers in the article are not reflecting what I have seen thus far. Perhaps vLLM hasn’t been updated to use the new tensor APIs for metal?

My point is this: The M5 should have reflected this in the charts, but it doesn’t. The situation on pre-fill is not nearly as bad as in the M4 generation.

ademeure 77 days ago

Apple GPUs didn’t have tensor cores until the M5 (aka “a neural accelerator in each core”) and in the article’s charts that a M5 Pro significantly beats a M4 Max (while in other workloads it would be much smaller since Pro is ~1/2 Max).

EDIT: since Aurornis beat me by 3 minutes, I’ll add another interesting tidbit instead :)

NVIDIA tensor cores on consumer GPUs are massively less powerful per SM core than on their datacenter counterparts-parts (which also makes them easier to get to peak efficiency on consumer GPUs because the rest of the pipeline is much more quickly a bottleneck as per Amdahl’s Law).

This is potentially changing with Vera Rubin CPX which looks an awful lot like a RTX 5090 replacement but with the full-blown datacenter tensor cores (that won’t be available unless you pay for the datacenter SKU) - so it will have very high TFLOPS relative to its bandwidth.

The target market for the CPX is exactly this: prefill and Time To First Token. You can basically just throw compute at the problem for (parts of) prefill performance (but it won’t help anything else past a certain point) and the 5090/M5 are nowhere near that limit.

So the design choice for NVIDIA/Apple/etc of how much silicon to spend for this on consumer GPUs is mostly dictated by economics and how much they can reuse the same chips for the different markets.

tpurves 76 days ago

@Ademeure Where do you think the market will be by the time, say year from now, when Apple has rolled out it's M6 generation? Do you think one more process node and architecture revision will be enough yet to tip the balance that local LLM starts to go mainstream?

Melatonic 77 days ago

Does that include stuff like the Pro Blackwell 6000? Or are the tensor cores as good per SM comparably? They perform quite well on many tests

aviinuo 77 days ago

Pro Blackwell 6000 is just a 5090 with more VRAM. It does not have the tcgen05 (5th gen tensor core) instructions despite the "5th gen tensor core) branding and thus do not support any optimized Blackwell (sm100) kernels.

Every Blackwell card other than the (G)B100, (G)B200, (G)B300 and Jetson Thor, use the Ampere tensor core instruction (mma.sync) but with fp4/6/8 added on. Beyond that the DGX Spark (which is advertised as having the same architecture as B200) has especially weak (not tcgen05) tensor cores that have a very narrow operating window and low utilization.

mathisfun123 77 days ago

> I'm curious and not an expert here, do you know why the TTFT is so much worse on Mac?

because the GPUs aren't as fantastic as everyone assumes?

> might also be less optimised in MLX?

prefill has gotta be one of the most optimized paths in MLX...

bigyabai 77 days ago

No you don't understand, on Apple Silicon my CPU has comparable memory bandwidth to a $400 Pascal-era GPU. With the unified memory architecture, that means my iGPU gets 2016-levels of DDR transfer speed with none of the upsides of CUDA. It's the most cutting-edge hardware ever put in a personal computer, without a doubt.

fgfarben 77 days ago

Please show me on the 2016-era $400 Pascal GPU where you can install the 256 GB of VRAM.

bigyabai 76 days ago

We're quite lucky that Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate, is my point.

fgfarben 76 days ago

> Nvidia didn't ship a 256gb system at sub-500gb/s transfer rate

DGX Spark has 128 GB and only 273 GB/s BW. Are we lucky that NVIDIA did ship something even worse than what you specified? I'm confused.

People have been complaining [1] about how little VRAM NVIDIA ships with their GPUs for decades. Their whole game has been "oh, you want more VRAM? Buy more or pay us 50x for server grade with 10x as much VRAM. The more you buy, the more you save."

Apple did everyone a solid by shipping something way out of that distribution. We now know more than we did before! We know that a 284B parameter model with 13B active params (or 35B with 3B active, or 671B with 37B active) can outperform a 2T model and draw a fraction as much power. How can you think that's a bad thing?

You could point out that Apple didn't invent the idea of MoE. Everyone knows that. But other than Macs, there simply were no machines with >100GB VRAM directly coupled to ~50 TFLOP/s of compute until the DGX Spark last Dec. If you wanted to run a model with more than 32 GB of weights, you had to either pay up for dozens of GPUs idling at hundreds of watts or really pay up for some $50,000 server GPUs idling at... also 100-200W each.

I feel lucky to have a $3k machine on my shelf that can run DS4-Flash with 1M context at 20t/s while drawing ~150W and making very little noise. The best part? It idles at 30W with DS4 loaded, dropping to 6W after a reboot. There isn't a single GPU on the market that can match that in the same shoebox volume.

[1] https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRlOW0N...

Moosdijk 77 days ago

It feels pedantic to point it out, but it’s actually 113x faster.

Seeing the author present their results like this give off the impression that they’re biased, which I am sure they aren’t.

scottjg 77 days ago

the exact numbers in the graph are 17019ms vs 142ms. so you're right, it's not 120x, it's 119.85x.

Moosdijk 77 days ago

That explains it. Thanks!

brcmthrowaway 77 days ago

Use oMLX. Qwen3.6 - 300tok/s PP, 30tok/s TG.

mercutio2 77 days ago

This is The Way.