| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simonw 39 days ago
	I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

2 comments

perfmode 39 days ago

How’s the token throughput / response time?

link

simonw 39 days ago

Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s

From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...

link

antirez 39 days ago

Prefill is 400 t/s in that hardware. Just if the prompt is very short you can't see the real speed and it will default to single token context processing.

link

simonw 39 days ago

Hah, that's my fault for just using "Generate an SVG of a pelican riding a bicycle" as my test prompt!

link

xienze 39 days ago

I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.

link

fgfarben 39 days ago

That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...

link

hadlock 39 days ago

M5 studio is gonna sell like hot cakes

link

throwdbaaway 39 days ago

Hah, that's because the prompt itself was only about 30 tokens. We need a much bigger prompt to properly test PP.

link

aiscoming 39 days ago

if it's just the coding agent system prompt and tools, you can cache that

link

xienze 39 days ago

Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.

link

embedding-shape 39 days ago

Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

prefill: 121.76 t/s, generation: 47.85 t/s

Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

link

incidentist 39 days ago

Someone is working on a fork that is optimized for M5, might be worth a look: https://github.com/Swival/ds4-m5

link

rtpg 39 days ago

what are token speeds like for frontier models, if that gives a rough idea of how much "slower" slow is?

link

chatmasta 39 days ago

So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.

link

simonw 39 days ago

I expect this to be my main machine for the next 3-4 years (which is how I justified the 128GB one). It's a beast of a machine - I love that I can run an 80GB model and still have 48GB left for everything else.

Can't say that it wouldn't be a better idea to spend that cash on tokens from the frontier hosted models though.

I'm an LLM nerd so running local models is worth it from a research perspective.

link

simpaticoder 39 days ago

An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4-5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference? Do you know what the big OpenAI/Anthropic/Google datacenters are running?

link

driese 39 days ago

As always: it depends on your needs. Here's a very basic heuristics rundown:

- More RAM: bigger models, more intelligence.

- More FLOPs: higher pre-fill (reading large files and long prompts before answering, the so-called "time to first token").

- More RAM bandwidth: higher token generation (speed of output).

So basically Macs (high RAM, okay bandwidth, lowish FLOPs) can run pretty intelligent models at an okay output speed but will take a long time to reply if you give them a lot of context (like code bases). Consumer GPUs have great speed and pre-fill time, but low RAM, so you need multiple if you want to run large intelligent models. Big boy GPUs like the RTX 6000 have everything (which is why they are so expensive).

There are some more nuances like the difference of Metal vs. CUDA, caching, parallelization etc., but the things above should hold true generally.

link

theturtletalks 39 days ago

Do you think Apple will fix prefill speed with the M6 Max MacBook Ultra 128GB?

link

jtbaker 39 days ago

It's already greatly improved over previous generations due to M5s having tensor cores (higher compute capacity for matmul operations, the bottleneck for prefill).

link