| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FloatArtifact 476 days ago

They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.

So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.

7 comments

lhl 475 days ago

Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.

From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.

lynguist 475 days ago

I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.

tgma 475 days ago

Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.

Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.

[1]: https://support.apple.com/en-us/101639

saagarjha 475 days ago

The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.

tgma 474 days ago

I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.

water9 475 days ago

Intel integrated graphics, technically also used unified memory with the standard dram

kergonath 475 days ago

Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.

McDaveNZ 475 days ago

Did the Xeons in the Mac Pro even have integrated graphics?

icedchai 474 days ago

So did the Amiga, almost 40 years ago...

kmacdough 475 days ago

That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.

That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.

icedchai 474 days ago

The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.

teknologist 473 days ago

Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.

angoragoats 470 days ago

As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).

The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.

brookst 475 days ago

Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?

happyopossum 474 days ago

> they specifically built this M3 Ultra for DeepSeek R1 4-bit

Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.

tempaccount420 471 days ago

Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?

vaxman 474 days ago

"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).

It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)

SV_BubbleTime 474 days ago

No one is saying they built a new chip.

But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.

cyanydeez 474 days ago

Dies are designed in years.

This was just a coincidence.

SV_BubbleTime 474 days ago

What part of “no one is saying they designed a new chip” is lost here?

forrestthewoods 475 days ago

I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.

reitzensteinm 475 days ago

Chip? Yes. Product? Not necessarily...

It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.

I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.

forrestthewoods 475 days ago

DeepSeek R1 came out Jan 20.

Literally impossible.

reitzensteinm 475 days ago

The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.

I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.

jahewson 475 days ago

That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.

brookst 475 days ago

Apple is positively building custom servers, and quantities are closer to the 100k range than 1000 [0]

But I agree they are not using m3 ultra for that. It wouldn’t make any sense.

0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...

teknologist 473 days ago

That could be why they're also selling it as the Mac Studio M3 Ultra

bustling-noose 475 days ago

My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.

tgma 475 days ago

Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.

See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...

bustling-noose 473 days ago

I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.

fennecfoxy 471 days ago

Sheesh, the...comments on that link.

nightski 475 days ago

$10k to run a 4 bit quantized model. Ouch.

OriginalMrPink 475 days ago

That's today. What about tomorrow?

water9 475 days ago

The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine

jrflowers 474 days ago

> they specifically built this M3 Ultra for DeepSeek R1 4-bit.

This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit

a1o 475 days ago

Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.

j45 475 days ago

Looks like up to 480W listed here

https://www.apple.com/mac-studio/specs/

a1o 474 days ago

Thanks!!

ryao 475 days ago

The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:

https://support.apple.com/en-us/102839

I assume it is similar.

drited 475 days ago

I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?

valine 475 days ago

Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

FloatArtifact 475 days ago

> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

valine 475 days ago

With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.

ein0p 475 days ago

What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.

valine 475 days ago

No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

doctorpangloss 475 days ago

Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

jonfromsf 475 days ago

Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.

ein0p 475 days ago

Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.

DevKoala 475 days ago

This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.

Der_Einzige 475 days ago

No one who is using this for home use cares about anything except batch size 1 sequence size 1.

ein0p 475 days ago

What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.

rfoo 475 days ago

For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.

bick_nyers 475 days ago

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.

valine 475 days ago

Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.

diggan 475 days ago

> The the question is if a llm will run with usable performance at that scale?

This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.

Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.

radlad 475 days ago

It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205

bastardoperator 475 days ago

It's fast enough for me to cancel monthly AI services on a mac mini m4 max.

diggan 475 days ago

Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?

Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.

lostmsu 475 days ago

Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.

Matl 475 days ago

Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.

nomel 475 days ago

It's probably much worse than that, with the falling prices of compute.

staticman2 475 days ago

Smaller, dumber models are faster than bigger, slower ones.

What model do you find fast enough and smart enough?

Matl 475 days ago

Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.

bastardoperator 474 days ago

I'm running the same exact models.

a1o 475 days ago

How much RAM?

jamesy0ung 475 days ago

I presume you're using the Pro, not the Max.

Anyways, what ram config, and what model are you using?

fetus8 475 days ago

How much RAM are you running on?

hangonhn 475 days ago

Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?

titzer 475 days ago

AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.

woadwarrior01 475 days ago

The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.

kridsdale1 475 days ago

I have to assume they’re doing something like that in the lab for 4 years from now.

azinman2 475 days ago

Memory bandwidth is the issue

bob1029 475 days ago

> The question is if a llm will run with usable performance at that scale?

For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.

kridsdale1 475 days ago

Someone has got to be working on a better method than that. Hundreds of billions are at stake.

cxie 476 days ago

Guess what? I'm on a mission to completely max out all 512GB of mem...maybe by running DeepSeek on it. Pure greed!

swivelmaster 475 days ago

You could always just open a few Chrome tabs…

ksec 475 days ago

It may not be Firefox in terms of hundreds or thousands of tabs but Chrome has gotten a lot more memory efficient since around 2022.

petepete 475 days ago

Give Cities Skylines 2 a try.

zactato 473 days ago

It doesn't support Macs yet

deepGem 475 days ago

Any idea what the sRAM to uRAM ratio is on these new GPUs ? If they have meaningfully higher sRAM than the Hopper GPUs, it could lead to meaningful speedups in large model training.

If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups

For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.

astrange 475 days ago

I don't know what you mean by s and u, but there is only one kind of memory in the machine, that's what unified memory means.

saagarjha 475 days ago

I assume they mean SRAM versus unified (D)RAM?

TheRealPomax 475 days ago

Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.

(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)

okanesen 475 days ago

Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.

choilive 475 days ago

Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s

hereonout2 475 days ago

Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.

But here it looks more of a bottleneck from my (admittedly naive) understanding.

choilive 475 days ago

For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.

manmal 475 days ago

My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token.

angoragoats 474 days ago

But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.