Hacker News new | ask | show | jobs
by mkaic 1498 days ago
This is really cool for a number of reasons:

1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.

Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency

2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.

3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

8 comments

> but they're already way ahead on energy efficiency

1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max

And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.

(which is normal, because no dedicated matrix math accelerators on the GPU notably)

2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)

3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.

What do you mean by: "if you can stand Metal for your use case?" What is Metal?
Metal is apple’s api for writing software that uses their GPUs.

https://en.m.wikipedia.org/wiki/Metal_(API)

> but they're already way ahead on energy efficiency.

For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.

The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c

And yet somehow Apples GPU ALUs are more efficient at 3.8 watts per TFLOP. Mind, I am not talking about specialized matrix multiplication units that have a different internal organization and can do things like matrix multiplication much more efficiently, but about basic general-purpose GPU ALUs.

The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.

As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.

How they do it at a conceptual level isn't a big secret: they don't need to minimize die area the way other companies do. For Apple, the die is just part of a chip that is part of the larger system they sell that they can amortize the cost over. nVidia doesn't have a system to do that with so their natural inclination is to lean towards keeping the die size as small as possible and just overclock the hell out of it. (right there is the 'trick': Apple can afford to do things that chew up die space that nVidia and others can't while maintaining their profit margins) Being a process generation ahead is also a rather huge thing too. (which is another thing they can amortize the cost of over large numbers of complete systems and mobile devices which their competitors can't)

Also related: Apple designs their hardware to do just what they want it to while everyone else is designing for a more general use case. This also costs die area, IP licensing fees etc.

But how does that apply to GPU ALUs? Looking at M1 die shots, they are comparatively tiny, and when comparing to other vendors, it doesn't seem like Apple is dedicated more logic space to the GPU. The M1 die is roughly 120mm2, an Nvidia Turing TU117 (GTX 1650) is roughly 200m2. Both feature the same amount of GPU ALUs (1024 32-bit units). And of course, M1's 5nm is around 5-6 times denser than Turing's 12nm, but M1 is an entire SoC with all kinds of components — not to mention a huge cache — the GPU takes maybe 20% of the die (let's say 1/3 if you also count in the display controller and memory controllers). All in all, the amount of normalised die space dedicated to GPU ALUs seems comparable.

Of course, my perspective here might be extremely naive, I know very little about semiconductor technology, just trying to understand the principal design differences.

Also I thought Apple is adding a large slice of cache. When you look at the 3D-V cache on Ryzen for performance (+15%?), this has a large impact. And because they sell expensive stuff, they can afford to build expensive CPUs.
Cache doesn't matter for pure ALU efficiency though. I mean, I did tests on long dependent chains of FMAs, the only memory touched there are two internal registers.
Apple Silicon is not ahead at all on energy efficiency for desktop workloads. If they were ahead on energy efficiency, they would simply be ahead on power. Indeed, GPUs are massively parallel architectures, and they are generally limited by the transistor and power budget (and memory, of course).

Apple is simply behind in the GPU space.

> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.

That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.

its memory is at a fraction (around 30-40%) of the memory bandwidth of a 128GB equivalent GPU setup

Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

I'm not sure what you meant with the link, but the parent is right, so adding an explanation here: M1 Ultra has about 400GB/s theoretical bandwidth but Anandtech shows that none of the SoC blocks can actually reach that, pretty far for it. It seems that Apple summed all the bandwidth to all the blocks to get there, which does mean something but not that the GPU has access to this (the GPU memory controllers seem to be the bottleneck).

On the contrary, a 3080 laptop does reach 400GB/s, I'm personally seeing this routinely on AI workloads, so that's part of the explanation for subpar perf here (the other ones being probably matrix math and mixed precision)

Yes. And the M1 Ultra has even more memory bandwidth than the M1 Max. But a 128 GB system made of 3 NVidia A6000 has 3x768Gb/s of memory bandwidth, a more common AI-grade card has 2x2Tb/s of memory bandwidth, which simply dwarfs the M1 Ultra.
For researchers, sure, but it's still quite an apples-to-oranges comparison.

A6000 is ~$5k per card. I guess you're referring to something like an A100 on that other spec, which is $10k/card (for 40GB of memory).

I do a fair bit of neural/AI art experimentation, where memory on the execution side is sometimes a limiting factor for me. I'm not training models, I'm not a hardcore researcher--those folks will absolutely be using NVIDIA's high-end stuff or TPU pods.

128GB in a Studio is super compelling if it means I can up-res some of my pieces without needing to use high-memory-but-super-slow CPU cloud VMs, or hope I get lucky with an A100 on Colab (or just pay for a GPU VM).

I have a 128GB/Ultra Studio in my office now. It's a great piece of kit, and a big reason I splurged on it--okay, maybe "excuse"--was that I expect it'll be useful for a lot of my side project workloads over the next couple of years...

Hmm, that's interesting. What kind of inference workload requires more than the 48GB of memory you'd get from 2 3090s, for example? I'm genuinely curious because I haven't ran across them and it sounds interesting
Mostly it's old-school transfer style transfer! Well, "old" in the sense that it's pre-CLIP. I've played with CLIP-guided stuff too, but I've been tinkering with a custom style transfer workflow for a few years. The pipeline here is fractal IFS images (Chaotica/JWildfire) -> misc processing -> style transfer -> photo editing, basically.

Only the workflow is the custom part--the core here is literally the original jcjohnson implementation. Occasionally I look around at recent work in the area, but most seems focused on fast (video-speed) inference or pre-baked style models. I've never seen something that retains artistic flexibility.

My original gut feeling on style transfer was that it would be possible to mold it into a neat tool, but most people bumped into it, ran their profile photo against Starry Night, said "cool" and bounced off. And I get that--parameter tuning can be a sloooow process. When I really explore a series with a particular style I start to feed it custom content images made just for how it's reacting with various inputs.

Here's a piece that just finished a few minutes ago: https://mwegner.com/misc/styled_render-BMrHXWz_2RBaUq8pAYKfL...

That's from a local server in my garage with a K80. At some point I had two K80s in there (so basically four K40s with how they work), but dialed it back for power consumption/power reasons.

I do have a 3090 in the house, and a decent amount of cloud infra that I sometimes tap. The jcjohnson implementation is so far back that it doesn't even run against modern hardware. At some point I need to sort that out, or figure out how to wrangle a more modern implementation into behaving in the way that I like.

I don't really post these anywhere, although do throw them over the wall on Twitter if anyone is curious to see more. These are a mix of things, although the CLIP/Midjourney/etc stuff is pretty easy to spot: https://twitter.com/mwegner/media

Not sure about inference but for training, 128GB is big enough to fit a decent-sized dataset entirely into memory, which causes a massive speedup. It's also probably cheaper to get a 128GB Mac Studio than a dual-3090 rig unless you're willing to build the rig yourself and pay the bare minimum for every component except the GPUs themselves.

As for 128GB memory on-inference models that a consumer would be interested in, I got nothing, though it certainly seems like it would be fun to mess around with haha

GPT-3 sized models need that kind of memory for inference
Interesting, I wasn't aware of the memory bandwidth point, though it makes sense. TIL!
I remain skeptical that Apple's best GPU silicon will match nvidia's premiere products (either the top-end desktop card, or a server monster) for training.

It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.

I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.

To me the cool thing is working through a PyTorch-based course like FastAI on a local Mac may now be above the tolerably fast threshold.
The thing is with the efficiency (which I'm not sure of) and the competition (probably possible) is that the current nvidia lineup is pretty old and on an even older process. They have a big moat.
There's definitely competition, and it's going to be really interesting to watch Nvidia and Apple duke it out over the next few years:

- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.

- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.

- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.

I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.

> an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory

Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?