| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cxie 475 days ago
	512GB of unified memory is truly breaking new ground. I was wondering when Apple would overcome memory constraints, and now we're seeing a half-terabyte level of unified memory. This is incredibly practical for running large AI models locally ("600 billion parameters"), and Apple's approach of integrating this much efficient memory on a single chip is fascinating compared to NVIDIA's solutions. I'm curious about how this design of "fusing" two M3 Max chips performs in terms of heat dissipation and power consumption though

15 comments

FloatArtifact 475 days ago

They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.

So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.

lhl 475 days ago

Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.

From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.

lynguist 475 days ago

I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.

tgma 475 days ago

Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.

Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.

[1]: https://support.apple.com/en-us/101639

saagarjha 475 days ago

The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.

tgma 474 days ago

I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.

water9 475 days ago

Intel integrated graphics, technically also used unified memory with the standard dram

kmacdough 475 days ago

That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.

That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.

icedchai 474 days ago

The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.

teknologist 473 days ago

Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.

brookst 475 days ago

Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?

happyopossum 474 days ago

> they specifically built this M3 Ultra for DeepSeek R1 4-bit

Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.

tempaccount420 470 days ago

Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?

vaxman 474 days ago

"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).

It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)

SV_BubbleTime 474 days ago

No one is saying they built a new chip.

But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.

cyanydeez 474 days ago

Dies are designed in years.

This was just a coincidence.

forrestthewoods 475 days ago

I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.

reitzensteinm 475 days ago

Chip? Yes. Product? Not necessarily...

It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.

I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.

forrestthewoods 475 days ago

DeepSeek R1 came out Jan 20.

Literally impossible.

jahewson 475 days ago

That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.

bustling-noose 475 days ago

My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.

tgma 475 days ago

Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.

See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...

nightski 475 days ago

$10k to run a 4 bit quantized model. Ouch.

OriginalMrPink 475 days ago

That's today. What about tomorrow?

water9 475 days ago

The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine

jrflowers 474 days ago

> they specifically built this M3 Ultra for DeepSeek R1 4-bit.

This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit

a1o 475 days ago

Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.

j45 475 days ago

Looks like up to 480W listed here

https://www.apple.com/mac-studio/specs/

a1o 474 days ago

Thanks!!

ryao 475 days ago

The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:

https://support.apple.com/en-us/102839

I assume it is similar.

drited 475 days ago

I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?

valine 475 days ago

Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

FloatArtifact 475 days ago

> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

"Memory bandwidth usage should be limited to the 37B active parameters."

Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

Context window?

How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

valine 475 days ago

With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.

ein0p 475 days ago

What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.

valine 475 days ago

No one should be buying this for batch inference obviously.

I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

doctorpangloss 475 days ago

Sure, nuance.

This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

Der_Einzige 475 days ago

No one who is using this for home use cares about anything except batch size 1 sequence size 1.

rfoo 475 days ago

For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

Anything in between suffers.

bick_nyers 475 days ago

Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.

valine 475 days ago

Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.

diggan 475 days ago

> The the question is if a llm will run with usable performance at that scale?

This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.

Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.

radlad 475 days ago

It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205

bastardoperator 475 days ago

It's fast enough for me to cancel monthly AI services on a mac mini m4 max.

diggan 475 days ago

Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?

Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.

lostmsu 475 days ago

Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.

Matl 475 days ago

Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.

nomel 475 days ago

It's probably much worse than that, with the falling prices of compute.

staticman2 475 days ago

Smaller, dumber models are faster than bigger, slower ones.

What model do you find fast enough and smart enough?

Matl 475 days ago

Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.

jamesy0ung 475 days ago

I presume you're using the Pro, not the Max.

Anyways, what ram config, and what model are you using?

fetus8 475 days ago

How much RAM are you running on?

hangonhn 475 days ago

Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?

titzer 475 days ago

AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.

woadwarrior01 475 days ago

The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.

kridsdale1 475 days ago

I have to assume they’re doing something like that in the lab for 4 years from now.

azinman2 475 days ago

Memory bandwidth is the issue

bob1029 475 days ago

> The question is if a llm will run with usable performance at that scale?

For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.

kridsdale1 475 days ago

Someone has got to be working on a better method than that. Hundreds of billions are at stake.

cxie 475 days ago

Guess what? I'm on a mission to completely max out all 512GB of mem...maybe by running DeepSeek on it. Pure greed!

swivelmaster 475 days ago

You could always just open a few Chrome tabs…

ksec 475 days ago

It may not be Firefox in terms of hundreds or thousands of tabs but Chrome has gotten a lot more memory efficient since around 2022.

petepete 475 days ago

Give Cities Skylines 2 a try.

zactato 473 days ago

It doesn't support Macs yet

deepGem 475 days ago

Any idea what the sRAM to uRAM ratio is on these new GPUs ? If they have meaningfully higher sRAM than the Hopper GPUs, it could lead to meaningful speedups in large model training.

If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups

For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.

astrange 475 days ago

I don't know what you mean by s and u, but there is only one kind of memory in the machine, that's what unified memory means.

saagarjha 475 days ago

I assume they mean SRAM versus unified (D)RAM?

TheRealPomax 475 days ago

Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.

(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)

okanesen 475 days ago

Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.

choilive 475 days ago

Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s

hereonout2 475 days ago

Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.

But here it looks more of a bottleneck from my (admittedly naive) understanding.

choilive 475 days ago

For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.

angoragoats 474 days ago

But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.

sudoshred 475 days ago

Agree. Finally I can have several hundred browser tabs open simultaneously with no performance degradation.

protocolture 475 days ago

Well at least 20

Dban1 475 days ago

New update just came in, make that 15

nikisweeting 474 days ago

My M1 Max regularly pushes 1000+ tabs without breaking a sweat, I feel like this particular metric is no longer useful now that background tab memory is almost always unloaded by the browser.

nullc 474 days ago

I'm not sure that unified memory is particularly relevant for that-- so e.g. on zen4/zen5 epyc there is more than enough arithmetic power that LLM inference is purely memory bandwidth limited.

On dual (SP5) Epyc I believe the memory bandwidth is somewhat greater than this apple product too... and at apple's price points you can have about twice the ram too.

Presumably the apple solution is more power efficient.

PeterStuer 475 days ago

Is this on chip memory? From the 800GB/s I would guess more likely a 512bit bus (8 channel) to DDR5 modules. Doing it on a quad channel would just about be possible, but really be pushing the envelope. Still a nice thing.

As for practicality, which mainstream applications would benefit from this much memory paired with a nice but relative mid compute? At this price-point (14K for a full specced system), would you prefer it over e.g. a couple of NVIDIA project DIGITS (assuming that arrives on time and for around the announced the 3K price-point)?

zitterbewegung 475 days ago

NVIDIA project DIGITS has 128 GB LPDDR5x coherent unified system memory at a 273 Gb/s memory bus speed.

bangaladore 475 days ago

It would be 273 GB/s (gigabytes, not gigabits). But in reality we don't know the bandwidth. Some ex employee said 500 GB/s.

You're source is a reddit post in which they try to match the size to existing chips, without realizing that its very likely that NVIDIA is using custom memory here produced by Micron. Like Apple uses custom memory chips.

PeterStuer 474 days ago

Yes, but for the price of that single M3 ultra I could have 4 of those GB10's running in a 2x2 cluster with the full NVIDIA stack supported (which is still a big thing)

So M3 preference will depend on whether a niche can significantly benefit from a monolitic lower compute high memory vs higher compute but distributed setup.

MBCook 475 days ago

Unless something had changed its on package, but not the same die.

rlt 474 days ago

Is putting RAM on the same chip as processing economical?

I would have assumed you’d want to save the best process/node for processing, and could use a less expensive processes for RAM.

RataNova 475 days ago

It's a game changer for sure.... 512GB of unified memory really pushes the envelope, especially for running complex AI models locally. That said, the real test will be in how well the dual-chip design handles heat and power efficiency

resters 475 days ago

The same thing could be designed with greater memory bandwidth, and so it's just a matter of time (for NVIDIA) until Apple decides to compete.

dheera 475 days ago

It will cost 4X what it costs to get 512GB on an x86 server motherboard.

valine 475 days ago

What would it cost to get 512GB of VRAM on an Nvidia card? That’s the real comparison.

dheera 475 days ago

Apples to oranges. NVIDIA cards have an order of magnitude more horsepower for compute than this thing. A B100 has 8 TB/s of memory bandwidth, 10 times more than this. If NVIDIA made a card with 512GB of HBM I'd expect it to cost $150K.

The compute and memory bandwidth of the M3 Ultra is more in-line with what you'd get from a Xeon or Epyc/Threadripper CPU on a server motherboard; it's just that the x86 "way" of doing things is usually to attach a GPU for way more horsepower rather than squeezing it out of the CPU.

This will be good for local LLM inference, but not so much for training.

pklausler 475 days ago

This prompts an "old guy anecdote"; forgive me.

When I was much younger, I got to work on compilers at Cray Computer Corp., which was trying to bring the Cray-3 to market. (This was basically a 16-CPU Cray-2 implemented with GaAs parts; it never worked reliably.)

Back then, HPC performance was measured in mere megaflops. And although the Cray-2 had peak performance of nearly 500MF/s/CPU, it was really hard to attain, since its memory bandwidth was just 250M words/s/CPU (2GB/s/CPU); so you had to have lots of operand re-use to not be memory-bound. The Cray-3 would have had more bandwidth, but it was split between loads and stores, so it was still quite a ways away from the competing Cray X-MP/Y-MP/C-90 architecture, which could load two words per clock, store one, and complete an add and a multiply.

So I asked why the Cray-3 didn't have more read bandwidth to/from memory, and got a lesson from the answer that has stuck. You could actually see how much physical hardware in that machine was devoted to the CPU/memory interconnect, since the case was transparent -- there was a thick nest of tiny blue & white twisted wire pairs between the modules, and the stacks of chips on each CPU devoted to the memory system were a large proportion of the total. So the memory and the interconnect constituted a surprising (to me) majority of the machine. Having more floating-point performance in the CPUs than the memory could sustain meant that the memory system was oversubscribed, and that meant that more of the machine was kept fully utilized. (Or would have been, had it worked...)

In short, don't measure HPC systems with just flops. Measure the effective bandwidth over large data, and make sure that the flops are high enough to keep it utilized.

worthless-trash 475 days ago

That is a great story. Please never hesitate to drop these in.

Do you have a blog?

musicale 475 days ago

> so you had to have lots of operand re-use to not be memory-bound

Looking at Nvidia's spec sheet, an H100 SXM can do 989 tf32 teraflops (or 67 non-tensor core fp32 teraflops?) and 3.35 TB/s memory (HBM) bandwidth, so ... similar problem?

pklausler 475 days ago

There is caching today.

LeifCarrotson 475 days ago

Yep, it's apples to oranges. But sometimes you want apples, and sometimes you want oranges, so it's all good!

There's a wide spectrum of potential requirements between memory capacity, memory bandwidth, compute speed, compute complexity, and compute parallelism. In the past, a few GB was adequate for tasks that we assigned to the GPU, you had enough storage bandwidth to load the relevant scene into memory and generate framebuffers, but now we're running different workloads. Conversely, a big database server might want its entire contents to be resident in many sticks of ECC DIMMs for the CPU, but only needed a couple dozen x86-64 threads. And if your workload has many terabytes or petabytes of content to work with, there are network file systems with entirely different bandwidth targets for entire racks of individual machines to access that data at far slower rates.

There's a lot of latency between the needs of programmers and the development and shipping of hardware to satisfy those needs, I'm just happy we have a new option on that spectrum somewhere in the middle of traditional CPUs and traditional GPUs.

As you say, if Nvidia made a 512 GB card it would cost $150k, but this costs an order of magnitude less than that. Even high-end consumer cards like a 5090 have 16x less memory than this does (average enthusiasts on desktops have maybe 8 GB) and just over double the bandwidth (1.7 TB/s).

Also, nit pick FTA:

> Starting at 96GB, it can be configured up to 512GB, or over half a terabyte.

512 GB is exactly half of a terabyte, which is 1024 GB. It's too late for hard drives - the marketing departments have redefined storage to use multipliers of 1000 and invented "tebibytes" - but in memory we still work with powers of two. Please.

dheera 475 days ago

Sure, if you want to do training get an NVIDIA card. My point is that it's not worth comparing either Mac or CPU x86 setup to anything with NVIDIA in it.

For inference setups, my point is that instead of paying $10000-$15000 for this Mac you could build an x86 system for <$5K (Epyc processor, 512GB-768GB RAM in 8-12 channels, server mobo) that does the same thing.

The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.

KingOfCoders 475 days ago

But this is how it wonderfully works. +$4000 does two things: 1. Make Apple very very rich 2. Make people think this is better than a $10k EPYC. Win-Win for Apple. At the point when you have convinced that you are the best, higher price just means people think you are even better.

MBCook 475 days ago

> The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.

That requires an otherwise equivalent PC to exist. I haven’t seen anyone name a PC with a half-TB of unified memory in this thread.

Yeah it’s $4k. Yeah that’s nuts. But it’s the only game in town like that. If the replacement is a $40k setup from Nvidia or whatever that’s a bargain.

kombine 475 days ago

An X86 server comparable in performance to M3 Ultra will likely be a few times more energy hungry, no?

egorfine 475 days ago

> we still work with powers of two. Please.

We do. Common people don't. It's easier to write "over half a terabyte" than explain (again) to millions of people what the power of two is.

johnklos 475 days ago

Anyone who calls 512 gigs "over half a terabyte" is bullshitting. No, thank you.

zitterbewegung 475 days ago

Since the GH200 has over a terabyte of VRAM at $343,000 and the H100 has 80GB that makes that $195,993 with a bit over 512GB of VRAM . You could beat the price of the Apple M3 Ultra with an AMD EPYC build.

treesciencebot 475 days ago

GH200 is nowhere near $343,000 number. You can get a single server order around 45k (with inception discount). If you are buying bulk, it goes down to sub-30k ish. This comes with a H100's performance and insane amount of high bandwith memory.

wmf 475 days ago

They probably meant 8xH200 for $343,000 which is in the ballpark.

zitterbewegung 475 days ago

Yes this is what I meant since 8 would cover 512GB of Ram

bick_nyers 475 days ago

About $12k when Project Digits comes out.

MBCook 475 days ago

Apple is shipping today. No future promises.

valine 475 days ago

That will only have 128GB of unified memory

dragonwriter 475 days ago

128GB for 3K; per the announcement their ConnectX networking allows two Project Digits devices to be plugged into eachother and work together as one device giving you 256GB for $6k, and, AFAIK, existing frameworks can split models across devices, as well, hence, presumably, the upthread suggestion that Project Digits would provide 512GB for $12k, though arguably the last step is cheating.

justincormack 475 days ago

the reason Nvidia only talk about two machines over the network is I think they only have one network port, so you need to add costs for a switch.

smith7018 475 days ago

You can build an x86 machine that can fully run DeepSeek R1 with 512GB VRAM for ~$2,500?

ta988 475 days ago

You will have to explain to me how.

bmelton 475 days ago

https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

muricula 475 days ago

Is that a CPU based inference build? Shouldn't you be able to get more performance out of the M3's GPU?

wmf 475 days ago

Inference is about memory bandwidth and some CPUs have just as much bandwidth as a GPU.

radlad 475 days ago

https://news.ycombinator.com/item?id=42897205

hbbio 475 days ago

How would you compare the tok/sec between this setup and the M3 Max?

aurareturn 475 days ago

3.5 - 4.5 tokens/s on the $2,000 AMD Epyc setup. Deepseek 671b q4.

The AMD Epyc build is severely bandwidth and compute constrained.

~40 tokens/s on M3 Ultra 512GB by my calculation.

wolfgangK 475 days ago

IMO, it would be more interesting to have a 3-way comparison of price/performance between DeepSeek 671b running on :

1. M3 Ultra 512 2. AMD Epyc (which Gen ? AVX512 and DDR5 might make a difference in both performance and cost , Gen 4 or Gen 5 have 8 or 9 t/s https://github.com/ggml-org/llama.cpp/discussions/11733 ) 2. AMD Epyc + 4090 or 5090 running KTransformers (over 10 t/s decode ? https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...)

hbbio 475 days ago

Thanks!

If the M3 can run 24/7 without overheating it's a great deal to run agents. Especially considering that it should run only using 350W... so roughly $50/mo in electricity costs.

aenis 475 days ago

Out of curiosity, if you dont mind: what kind of an agent would you run 24/7 locally?

I'd assume this thing peaks at 350W (or whatever) but idles at around 40w tops?

sgt 475 days ago

What kind of Nvidia-based rig would one need to achieve 40 tokens/sec on Deepseek 671b? And how much would it cost?

aurareturn 475 days ago

Around 5x Nvidia A100 80GB can fit 671b Q4. $50k just for the GPUs and likely much more when including cooling, power, motherboard, CPU, system RAM, etc.

matt-p 475 days ago

Not really like for like.

The pricing isn't as insane as you'd think, 96 to 256GB is 1500 which isn't 'cheap' but, it could be worse.

All in 5,500 gets you a ultra with 256GB memory, 28 cores, 60 GPU cores, 10Gb network - I think you'd be hard pushed to build a server for less.

kllrnohj 475 days ago

5,500 easily gets me either vastly more CPU cores if I care more about that or a vastly faster GPU if I care more about that. Or for both a 9950x + 5090 (assuming you can actually find one in stock) is ~$3000 for the pair + motherboard, leaving a solid $2500 for whatever amount of RAM, storage, and networking you desire.

The M3 strikes a very particular middle ground for AI of lots of RAM but a significantly slower GPU which nothing else matches, but that also isn't inherently the right balance either. And for any other workloads, it's quite expensive.

seanmcdirmid 475 days ago

You'll need a couple of 32GB 5090s to run a quantized 70B model, maybe 4 to run a 70b model without quantization, forget about anything larger than that. A huge model might run slow on a M3 Ultra, but at least you can run it all.

I have a Max M3 (the non-binned one), and I feel like 64GB or 96GB is within the realm of enabling LLMs that run reasonable fast on it (it is also a laptop, so I can do things on planes or trips). I thought about the Ultra, if you have 128GB for a top line M3 Ultra, the models that you could fit into memory would run fairly fast. For 512GB, you could run the bigger models, but not very quickly, so maybe not much point (at least for my use cases).

matt-p 475 days ago

That config would also use about 10x the power, and you still wouldn't be able to run a model over 32GB whereas the studio can easily cope with 70B llama and plenty of space to grow.

I think it actually is perfect for local inference in a way that build or any other pc build in this price range would be.

kllrnohj 475 days ago

The M3 Ultra studio also wouldn't be able to run path traced Cyberpunk at all no matter how much RAM it has. Workloads other than local inference LLMs exist, you know :) After all, if the only thing this was built to do was run LLMs then they wouldn't have bothered adding so many CPU cores or video engines. CPU cores (along with networking) being 2 of the specs highlighted by the person I was responding to, so they were obviously valuing more than just LLM use cases.

dagmx 475 days ago

Bad game example because cyberpunk with raytracing is coming to macOS and will run on this.

kridsdale1 475 days ago

The core customer market for this thing remains Video Editors. That’s why they talk about simultaneous 8K encoding streams.

Apple’s Pro segment has been video editors since the 90s.

hot_gril 475 days ago

Well that's what (s)he meant, the Mac Studio fits the AI use case but not other ones so much.

jltsiren 475 days ago

Consumer hardware is cheap, if 192 GB of RAM is enough for you. But if you want to go beyond that, the Mac Studio is very competitively priced. A minimal Threadripper workstation with 256 GB is ~$7400 from Puget Systems. If you increase the memory to 512 GB, the price goes up to ~$10900. Mostly because 128 GB modules are about as expensive as what Apple charges for RAM. A Threadripper Pro workstation can use cheaper 8x64 GB for the same capacity, but because the base system is more expensive, you'll end up paying ~$11600.

TylerE 475 days ago

The Mac almost fits in the palm of your hand, and runs, if not silently, practically so. It doesn't draw excessive power or generate noticeable heat.

None of those will be true for any PC/Nvidia build.

It's hard to put a price on quality of life.

AnthonBerg 475 days ago

That’s not going to yield the same bandwidth or memory latency though, right?

rbanffy 475 days ago

You'd need a chip with 8 memory channels. 16 DIMM slots, IIRC.

TheRealPomax 475 days ago

I think the other big thing is that the base model finally starts at a normal amount of memory for a production machine. You can't get less than 96GB. Although an extra $4000 for the 512GB model seems Tim Apple levels of ridiculous. There is absolutely no way that the different costs anywhere near that much at the fab.

And the storage solution still makes no sense of course, a machine like this should start at 4TB for $0 extra, 8TB for $500 more, and 16TB for $1000 more. Not start at a useless 1TB, with the 8TB version costing an extra $2400 and 16TB a truly idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives for $1000, SoC storage should set you back half that, not over double that.

jjtheblunt 475 days ago

> There is absolutely no way that the different costs anywhere near that much at the fab.

price premium probably, but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.

wtallis 475 days ago

> but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.

Apple's not having TSMC fab a massive die full of memory. They're buying a bunch of small dies of commodity memory and putting them in a package with a pair of large compute dies. How many of those small commodity memory dies they use has nothing to do with yield.

jjtheblunt 474 days ago

Is there a teardown link available for what you wrote? If so, that’s interesting.

cayleyh 473 days ago

This has been pretty clear about all Apple chip designs, going back to some of the first A series afaik. They are "unified memory" but not "memory on die", they've always been "memory on package"-- ie. the ram is packaged together with the CPU, often under a single heat spreader, but they are separate components.

Apple's own product shots have shown this. Here's a bunch of links that clearly show the memory as separate. Lots of these modules you can make out the serial or model numbers and look up the manufacturer of them from directly :)

- Side-by-side teardown of M1 Pro vs M2 Pro laptop motherboards showing separate ram chips with discussion on how apple is moving to different type of ram configurations: https://www.ifixit.com/News/71442/tearing-down-the-14-macboo...

- M2 teardown with the chip + ram highlighted: https://www.macrumors.com/2022/07/18/macbook-air-m2-chip-tea...

- Photo of the A12 with separate ram chips on a single "package": https://en.wikipedia.org/wiki/Apple_A12X

- M1 Ultra with heat spreader removed, clearly showing 3rd party ram chips onpackage: https://iphone-mania.jp/news-487859/

jjtheblunt 472 days ago

neat! thanks

MBCook 475 days ago

This is also a niche product. The number they sell is going to be very tiny compared to the base model MacBook, let alone the iPhone.

Apple absolutely loves to gouge for upgrades, but the chips in this have got to be expensive. I almost wonder if the absolute base model of this machine has much noticeably lower margins than a normal Apple product because that. But they expect/know that most everyone who buys one is going to spec it up.

TheRealPomax 475 days ago

It's Apple, price premium is a given.

tempest_ 475 days ago

Nvidia has had the Grace Hoppers for a while now. Is this not like that?

ykl 475 days ago

This is cheap compared to GB200, which has a street price of >$70k for just the chip alone if you can even get one. Also GB200 technically has only 192GB per GPU and access to more than that happens over NVLink/RDMA, whereas here it’s just one big flat pool of unified memory without any tiered access topology.

rbanffy 475 days ago

We finally encountered the situation where an Apple computer is cheaper than its competition ;-)

All joking aside, I don't think Apples are that expensive compared to similar high-end gear. I don't think there is any other compact desktop computer with half a terabyte of RAM accessible to the GPU.

nightski 475 days ago

I mean expensive relative to who, Nvidia? Both are enjoying little to no competition in their respective niche and are using that monopoly power to extract massive margins. I have no doubt it could be much cheaper if there was actual competition in the market.

Fortunately it seems like AMD is finally catching on and working towards producing a viable competitor to the M series chips.

kridsdale1 475 days ago

And yet all that cash still just goes to TSMC

rbanffy 475 days ago

They are selling the shovels for this gold rush. Also, ASML, who sells machines to make shovels.

ProAm 475 days ago

This is just Apple disrespecting their customer base.

asdffdasy 475 days ago

still not ECC

samstave 475 days ago

"unified memory"

funny that people think this is so new, when CRAY had Global Heap eons ago...

webworker 475 days ago

The real hardware needed for artificial intelligence wasn't NVIDIA, it was a CRAY XMP from 1982 all along

samstave 475 days ago

WHen I was with Mirantis, I flew to Austin TX to meet a client in a non-descript multi-tenant office building...

we walked in and getting our bearings, we come upon CRAY office. WTF?!

I tried the doors, locked - and it was clearly empty... but damn did I want to steal their office door signage.

hot_gril 475 days ago

It's new for mainstream PCs to have it.

pjmlp 475 days ago

Nope, it was common in 8 and 16 bit home computers, and in respect to PCs themselves graphics memory was mapped into the main memory until the arrival of 3D dedicated cards.

And even with 3D, integrated GPUs have existed for years.

hot_gril 474 days ago

The CPUs with iGPUs didn't also have the memory on-chip. The Nintendo 64 did. Not sure about the old home computers, but I thought those had separate memory usually.

pjmlp 474 days ago

Of course not, because they are not designed as SOCs, the only memory on chip is cache, it doesn't change the fact the memory is one whole block shared between CPU and iGPU.

angoragoats 474 days ago

Apple does not have the memory on-chip (on the same die as the CPU) either.

djmips 475 days ago

Like pretty much every game console.

TylerE 475 days ago

New for performance machines maybe. I remember "integrated graphics" when that meant some shitty co-processor and 16 or 32MB of semi-reserved system RAM.

Vilian 475 days ago

It's not new for PC to block user ram upgrade

grandempire 474 days ago

You mean the room sized super computer than sold tens of units?

samstave 474 days ago

Yes, but now its in my pocket.

ddtaylor 475 days ago

Why did it take so long for us to get here?

RachelF 475 days ago

Some possible groups of reasons: 1. Until recently RAM amount was something the end user liked to configure, so little market demand. 2. Technically, building such a large system on a chip or collection of chiplets was not possible. 3. RAM speed wasn't a bottleneck for most tasks, it was IO or CPU. LLMs changed this.

hot_gril 475 days ago

M1 came out before the LLM rush, though

wtallis 475 days ago

The M1 is in a product segment where discrete GPUs have been gone for decades, in favor of integrated graphics that shares one pool of RAM with the CPU. The better question to ask is why Apple kept using that unified memory design even when moving up to larger chips like the M1 Max and M1 Ultra.

MBCook 475 days ago

The GPU is built into the same physical die as the CPU.

So if you wanted to give it a second ram pool you would have to add an entire second memory interface just for the on-die GPU.

Now all you’ve done is make it more complicated, slower because now you have to move things between the two pools, and gained what exactly?

I think it was a very clear and obvious decision to make. It’s an outgrowth out of how the base chips were designed, and it turned out to be extremely handy for some things. Plus since all their modern devices now work this way that probably simplify the software.

I’m not saying it’s genius foresight, but it certainly worked out rather well. There’s nothing stopping them from supporting discreet GPUs too if they wanted to. They just clearly don’t.

RachelF 475 days ago

I'd guess that they inherited it from the iPhone chips. It was nice and fast and also makes Apple a lot of profit as no third party RAM is possible.

hot_gril 475 days ago

They put the M1 into the desktops too

MatthiasPortzel 475 days ago

Apple debuted dedicated machine learning hardware in 2017 with the Neural Engine on iPhones. While I don’t think they predicted the LLM explosion in particular, they knew machine learning was important and they have been allowing that to influence hardware design.

philistine 475 days ago

Apple has always liked to integrate as much as possible on the same chip. It was only natural that they would come to this conclusion, with the improved perf the cherry on top.

hot_gril 475 days ago

Well also these chips originated in phones, where they kinda had to integrate it. And the quicker RAM and disk access are pretty nice.

wmf 475 days ago

Laptops have had unified memory for ten years or more. For desktops very few apps benefit from unified memory.

djmips 475 days ago

And game consoles that use similar parts as laptops.

baby_souffle 475 days ago

Just a guess, but fabricating this can't be easy. Yield is probably higher if you have less memory per chip.

astrange 475 days ago

It's regular memory on separate chips.

amelius 475 days ago

Why does it matter if you can run the LLM locally, if you're still running it on someone else's locked down computing platform?

PeterStuer 475 days ago

Running locally, your data is not sent outside of your security perimeter off to a remote data center.

If you are going to argue that the OS or even below that the hardware could be compromised to still enable exfiltration, that is true, but it is a whole different ballgame from using an external SaaS no matter what the service guarantees.

bigyabai 475 days ago

For enterprise markets, this is table stakes. A lot of datacenter customers will probably ignore this release altogether since there isn't a high-bandwidth option for systems interconnect.

pavlov 475 days ago

The Mac Studio isn’t meant for data centers anyway? It’s a small and silent desktop form factor — in every respect the opposite of a design you’d want to put in a rack.

A long time ago Apple had a rackmount server called Xserve, but there’s no sign that they’re interested in updating that for the AI age.

bigyabai 475 days ago

It's the Ultra chip, the same one that goes into the rackmount Mac Pro. I don't think there's much confusion as to who this is for.

> there’s no sign that they’re interested in updating that for the AI age.

https://security.apple.com/blog/private-cloud-compute/

wtallis 475 days ago

The rackmount Mac Pro is for A/V studios, not datacenters.

phillco 475 days ago

Don't forget CI/CD farms for iOS builds, although I think it's much more cost effective to just make Minis or Studios work, despite their nonstandard formfactor

kridsdale1 475 days ago

Google and Facebook have vast fleets of Minis in custom chassis for this purpose.

pavlov 475 days ago

I genuinely forgot the Mac Pro still exists. It’s been so long since I even saw one.

And I’ve had every previous Mac tower design since 1999: G4, G5, the excellent dual Xeon, the horrible black trash can… But Apple Silicon delivers so much punch in the Studio form factor, the old school Pro has become very niche.

Edit - looks like the new M3 Ultra is only available in Mac Studio anyway? So the existence of the Pro is moot here.

choilive 475 days ago

never understood the hate on the trash can. Isn't the mac studio basically the same idea as the trash can but even less upgradeable?

pavlov 475 days ago

The Mac Studio hit a sweet spot in 2023 that the trash can Mac Pro couldn't ten years earlier. It's mostly thanks to the high integration of Apple Silicon and improved device availability and speed of Thunderbolt.

The 2013 Mac Pro was stuck forever with its original choice of Intel CPU and AMD GPU. And it was unfortunately prone to overheating due to these same components.

Alupis 475 days ago

Outside of extremely niche use cases, who is racking apple products in 2025?

nordsieck 475 days ago

There's MacMiniVault (nee MacMiniColo) https://www.macminivault.com/

Not sure if they count as niche or not.

kube-system 475 days ago

Every provider who offers MacOS in the cloud.

Alupis 475 days ago

So MacOS is still not allowed to be virtualized per the EULA? Wow if that's true...

wpm 475 days ago

AWS

waveringana 475 days ago

github for their macos runners (pretty sure theyre m1 minis)

alwillis 475 days ago

Apple recently announced they’re building a new plant in Texas to produce servers. Yes, they need servers for their Private Compute Cloud used by Apple Intelligence, but it doesn’t only need to be for that.

From https://www.apple.com/newsroom/2025/02/apple-will-spend-more...

As part of its new U.S. investments, Apple will work with manufacturing partners to begin production of servers in Houston later this year. A 250,000-square-foot server manufacturing facility, slated to open in 2026, will create thousands of jobs.

phonon 475 days ago

Thunderbolt 5 can do bi-directional 80 Gbps....and Mac Studio Ultra has 6 ports...

cibyr 475 days ago

That's still not even competitive with 100G Ethernet on a per-port basis. An overall bandwidth of 480 Gbps pales in comparison with, for example, the 3200 Gbps you get with a P5 instance on EC2.

phonon 475 days ago

A 3 year reservation of a P5 is over a million dollars though? Not sure how that's comparable....

nyrikki 475 days ago

To add to this GPU servers like supermicro have a 400GBe port per GPU plus more for the CPU.

kridsdale1 475 days ago

Cost competitive though?

spiderfarmer 475 days ago

You can use Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 or 5 Mac Studios.

atwrk 475 days ago

But 80Gbit/s is way slower than even regular dual channel RAM, or am I missing something here? That would mean the LLM would be excruciatingly slow. You could get an old EPYC for a fraction of that price and have more performance.

wmf 475 days ago

The weights don't go over the network so performance is OK.

atwrk 475 days ago

If I'm not mistaken, each token produced roughly equals the whole model in memory transfers (the exception being MoE models). That's why memory bandwidth is so important in the first place, or not?

wmf 475 days ago

My understanding is that if you can store 1/Nth of the weights in RAM on each of the N nodes then there's no need to send the weights over the network.

whimsicalism 475 days ago

why would you ever want to do that remains an open question

aurareturn 475 days ago

Probably some kind of local LLM server. 1TB of 1.6 TB/s memory if you link 2 together. $20k total. Half the price of a single Blackwell chip.

whimsicalism 475 days ago

with a vanishingly small fraction of flops and a small fraction of memory bandwidth

aurareturn 475 days ago

It's good enough to run whatever local model you want. 2x 80core GPU is no joke. Linking them together gives it effectively 1.6 TB/s of bandwidth. 1TB of total memory.

You can run the full Deepseek 671b q8 model at 40 tokens/s. Q4 model at 80 tokens/s. 37B active params at a time because R1 is MoE.

Linking 2 of these together let's you run a model more capable (R1) than GPT4o at a comfortable speed at home. That was simply fantasy a year ago.

burnerthrow008 475 days ago

> with a vanishingly small fraction of flops and a small fraction of memory bandwidth

Is it though?

Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.

Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.

I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.

nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.

[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...

[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...

Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.

PaulHoule 475 days ago

That article says you can connect them through the Thunderbolt 5 somehow to form clusters.

burnerthrow008 475 days ago

I wonder if that’s something new, or just the same virtual network interface that’s been around since the TB1 days (a new network interface appears when you connect two Macs with a TB cable)

jauntywundrkind 475 days ago

Its the same host-to-host usb network, I believe.

I'm super interested in the clustering capability. At launch people said they were only getting like 11Gbps from their TB4 drive arrays, which was really way less than expected.

Apple does kind of advertise that each TB port has its own controllers. Which gives me hope that whatever 1x port can do 6x can do 6x better.

AMD's Strix Halo victory feels much more shallow today. Eventually 48GB or 64GB sticks will probably expand Strix Halo to 192 then 256GB. But Strix Halo is super super io starved, is basically a desktop of IO, with no way to easily host-to-host, and Apple absolutely understands that the use of a chip is bounded by what it can connect to. 6x TB5, if even half true, will be utterly outstanding.

It's been so so so so cool to see Non-Transparent Bridging atop thunderbolt, so one host can act like a device. Since it's PCIe, that hypothetically would allow amazing RDMA over TB. USB4 mandates host to host networking, but I have no idea how it is implemented and I suspect it's no where near as close to the metal.

PaulHoule 475 days ago

In 2017 I was working for a company that was trying to develop foundation models and I was developing a framework for training what were then large neural network [1] and other models.

It was "yet another mac-oriented startup" but I had them get me an Alienware laptop because I could get one with a 1070 mobile card that meant I could train on my laptop whereas the data sci's had to do everything on our DGX-1. [2]

Today it is the other way around, the Mac Studio looks like the best AI development workstation you can get.

[1] I was really partial to a character-level CNN model we had

[2] CEO presented next to Jensen Huang at a NVIDIA conference, his favorite word was "incredible". I thought it was "incredible" when I heard they got bought by Nike, but it was true.

PaulHoule 475 days ago

Well already it is faster than GigE...

https://arstechnica.com/gadgets/2013/10/os-x-10-9-brings-fas...

Thunderbolt is PCIe-based and I could imagine it being extended to do what https://en.wikipedia.org/wiki/Compute_Express_Link and https://en.wikipedia.org/wiki/InfiniBand