Hacker News new | ask | show | jobs
by bigyabai 476 days ago
For enterprise markets, this is table stakes. A lot of datacenter customers will probably ignore this release altogether since there isn't a high-bandwidth option for systems interconnect.
4 comments

The Mac Studio isn’t meant for data centers anyway? It’s a small and silent desktop form factor — in every respect the opposite of a design you’d want to put in a rack.

A long time ago Apple had a rackmount server called Xserve, but there’s no sign that they’re interested in updating that for the AI age.

It's the Ultra chip, the same one that goes into the rackmount Mac Pro. I don't think there's much confusion as to who this is for.

> there’s no sign that they’re interested in updating that for the AI age.

https://security.apple.com/blog/private-cloud-compute/

The rackmount Mac Pro is for A/V studios, not datacenters.
Don't forget CI/CD farms for iOS builds, although I think it's much more cost effective to just make Minis or Studios work, despite their nonstandard formfactor
Google and Facebook have vast fleets of Minis in custom chassis for this purpose.
I genuinely forgot the Mac Pro still exists. It’s been so long since I even saw one.

And I’ve had every previous Mac tower design since 1999: G4, G5, the excellent dual Xeon, the horrible black trash can… But Apple Silicon delivers so much punch in the Studio form factor, the old school Pro has become very niche.

Edit - looks like the new M3 Ultra is only available in Mac Studio anyway? So the existence of the Pro is moot here.

never understood the hate on the trash can. Isn't the mac studio basically the same idea as the trash can but even less upgradeable?
The Mac Studio hit a sweet spot in 2023 that the trash can Mac Pro couldn't ten years earlier. It's mostly thanks to the high integration of Apple Silicon and improved device availability and speed of Thunderbolt.

The 2013 Mac Pro was stuck forever with its original choice of Intel CPU and AMD GPU. And it was unfortunately prone to overheating due to these same components.

The trash can also suffered from hitting the market right around when the industry gave up on making dual-GPU work.
Folks that want to keep the customisation aspect of Mac Pro hardly see that.

In fact a very famous podcaster is still holding out to his.

The Studio also hits a sweet spot for home users like me that want tons of IO and no built in input devices.
Outside of extremely niche use cases, who is racking apple products in 2025?
There's MacMiniVault (nee MacMiniColo) https://www.macminivault.com/

Not sure if they count as niche or not.

Every provider who offers MacOS in the cloud.
So MacOS is still not allowed to be virtualized per the EULA? Wow if that's true...
MacOS is permitted to be virtualized... as long as the host is a Mac. :)
AWS
github for their macos runners (pretty sure theyre m1 minis)
Apple recently announced they’re building a new plant in Texas to produce servers. Yes, they need servers for their Private Compute Cloud used by Apple Intelligence, but it doesn’t only need to be for that.

From https://www.apple.com/newsroom/2025/02/apple-will-spend-more...

As part of its new U.S. investments, Apple will work with manufacturing partners to begin production of servers in Houston later this year. A 250,000-square-foot server manufacturing facility, slated to open in 2026, will create thousands of jobs.

Thunderbolt 5 can do bi-directional 80 Gbps....and Mac Studio Ultra has 6 ports...
That's still not even competitive with 100G Ethernet on a per-port basis. An overall bandwidth of 480 Gbps pales in comparison with, for example, the 3200 Gbps you get with a P5 instance on EC2.
A 3 year reservation of a P5 is over a million dollars though? Not sure how that's comparable....
To add to this GPU servers like supermicro have a 400GBe port per GPU plus more for the CPU.
Cost competitive though?
You can use Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 or 5 Mac Studios.
But 80Gbit/s is way slower than even regular dual channel RAM, or am I missing something here? That would mean the LLM would be excruciatingly slow. You could get an old EPYC for a fraction of that price and have more performance.
The weights don't go over the network so performance is OK.
If I'm not mistaken, each token produced roughly equals the whole model in memory transfers (the exception being MoE models). That's why memory bandwidth is so important in the first place, or not?
My understanding is that if you can store 1/Nth of the weights in RAM on each of the N nodes then there's no need to send the weights over the network.
You're correct about the weights: each machine could in fact store all of the weights. However I think you still have to transfer the activations and the KV-Cache while performing inference.
why would you ever want to do that remains an open question
Probably some kind of local LLM server. 1TB of 1.6 TB/s memory if you link 2 together. $20k total. Half the price of a single Blackwell chip.
with a vanishingly small fraction of flops and a small fraction of memory bandwidth
It's good enough to run whatever local model you want. 2x 80core GPU is no joke. Linking them together gives it effectively 1.6 TB/s of bandwidth. 1TB of total memory.

You can run the full Deepseek 671b q8 model at 40 tokens/s. Q4 model at 80 tokens/s. 37B active params at a time because R1 is MoE.

Linking 2 of these together let's you run a model more capable (R1) than GPT4o at a comfortable speed at home. That was simply fantasy a year ago.

> with a vanishingly small fraction of flops and a small fraction of memory bandwidth

Is it though?

Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.

Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.

I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.

nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.

[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...

[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...

Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.

TFLOPS are teraflops not “tensor flops”.

Blackwell and modern AI chips are built for fp16. B100 has 1750 tflops of fp16. M3 ultra has ~80tflops of fp16 or about 4% that of b100

That article says you can connect them through the Thunderbolt 5 somehow to form clusters.
I wonder if that’s something new, or just the same virtual network interface that’s been around since the TB1 days (a new network interface appears when you connect two Macs with a TB cable)
Its the same host-to-host usb network, I believe.

I'm super interested in the clustering capability. At launch people said they were only getting like 11Gbps from their TB4 drive arrays, which was really way less than expected.

Apple does kind of advertise that each TB port has its own controllers. Which gives me hope that whatever 1x port can do 6x can do 6x better.

AMD's Strix Halo victory feels much more shallow today. Eventually 48GB or 64GB sticks will probably expand Strix Halo to 192 then 256GB. But Strix Halo is super super io starved, is basically a desktop of IO, with no way to easily host-to-host, and Apple absolutely understands that the use of a chip is bounded by what it can connect to. 6x TB5, if even half true, will be utterly outstanding.

It's been so so so so cool to see Non-Transparent Bridging atop thunderbolt, so one host can act like a device. Since it's PCIe, that hypothetically would allow amazing RDMA over TB. USB4 mandates host to host networking, but I have no idea how it is implemented and I suspect it's no where near as close to the metal.

In 2017 I was working for a company that was trying to develop foundation models and I was developing a framework for training what were then large neural network [1] and other models.

It was "yet another mac-oriented startup" but I had them get me an Alienware laptop because I could get one with a 1070 mobile card that meant I could train on my laptop whereas the data sci's had to do everything on our DGX-1. [2]

Today it is the other way around, the Mac Studio looks like the best AI development workstation you can get.

[1] I was really partial to a character-level CNN model we had

[2] CEO presented next to Jensen Huang at a NVIDIA conference, his favorite word was "incredible". I thought it was "incredible" when I heard they got bought by Nike, but it was true.

Well already it is faster than GigE...

https://arstechnica.com/gadgets/2013/10/os-x-10-9-brings-fas...

Thunderbolt is PCIe-based and I could imagine it being extended to do what https://en.wikipedia.org/wiki/Compute_Express_Link and https://en.wikipedia.org/wiki/InfiniBand