Hacker News new | ask | show | jobs
by Scene_Cast2 1492 days ago
I'm curious about the performance compared to something like, say, the RTX 3070.
3 comments

Low. Apple doesn't have matrix math accelerators in their current GPUs.

The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.

Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.

> Apple doesn't have matrix math accelerators in their current GPUs.

That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

AMX is indeed very nice for FP64 where customer GPUs aren't an alternative at all.

However, for lower precisions (which is what deep learning uses), you're much better off with a GPU.

have you actually benchmarked that? I think (someone please correct me if I'm way off here) the AMX instructions can hit ~2.8tflops (fp16) per co-processor and there are 2 on the 7-core M1. That's 5.6tflops vs the 4.6tflops the GPU can hit.
Yeah that's within the M1 family, but get within dGPUs and it doesn't even come close.

30Tflops for a 3080 for vector FP32, but 119Tflops FP16 dense with FP16 accumulate, 59.5 with FP32 accumulate, and if you exploit sparsity then that can go even higher.

Ah yes, I misunderstood your original comment
Often the limiting factor is memory bandwidth instead of raw FLOPS, so dealing with 4 times larger data types (FP64 vs FP16) is a disadvantage.
to clarify: I am comparing FP16 performance, which both the GPU and AMX have native support for.

FP64 is also supported by AMX, making it quite an impressive region of silicon.

> The neural engine is small and inference only

Why is it inference only? At least the operations are the same...just a bunch of linear algebra

Inference is often done fixed point, whereas training is (usually) floating point.

Inference also prefers different IO patterns, because you don't need to keep the activations for every layer ready for backpropogation.

I wrote a comment about an Tensorflow on M1 comparison to some cloud providers. I imagine PyTorch on M1 would give similar results. I think the gist would be that the 3070 is going to be a better investment.

https://news.ycombinator.com/item?id=30608125

Here are some comparison numbers I've come across: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.