Hacker News new | ask | show | jobs
by fxtentacle 1988 days ago
"trainable_params 12,810"

laughs

(for comparison, GPT3: 175,000,000,000 parameters)

Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!

Plus it looks like they are comparing Apples to Oranges ;) This seems to be 16 bit precision on the M1 and 32 bit on the V100. So the M1-trained model will most likely yield worse or unusable results, due to lack of precision.

And lastly, they are plainly testing against the wrong target. The V100 is great, but it is far from NVIDIA's flagship for training small low-precision models. At the FP16 that the M1 is using, the correct target would have been an RTX 3090 or the like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.

So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.

I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p

7 comments

GPT3 is so big it would take 355 years to train on a nvidia V100, so your example is also not really useful for comparison. It would be interesting to see some mid-sized nn benchmarks though.
This, not to mention one could get the GPU usage on the V100 way higher by training with larger batch sizes, which would also make training much faster.
thanks for the thorough comment. the article is, unfortunately, just clickbait.
It seems like a common trend with M1 articles on HN lately.
The comment is bogus empty snark (and factually wrong).

The arguments made (and I use the word arguments loosely):

"Too few trainable_params compared to GTP3".

GTP3 is several orders of magnitude higher than what people train, and so it's a useless comparison. It's like we're comparing a bike to an e-bike, and someone says "yeah, but can the e-bike run faster than a rocket?"

Second argument "Sure, it's faster than a machine that costs 3-4 fives more, but you should instead compare it to a machine that costs even more than that".

I can only take it as a troll comment.

Thorough? Their comment is noisy snark.

A huge number of models are "small". I'm currently training game units for autonomous behaviors. The M1 is massively oversized for my need.

Saying "Oh look, GPT-3" just stupidifies the conversation, and is classic dismissive nonsense.

Hard disagree. V100s are a perfectly valid comparison point. They're usually what's available at scale (on AWS, in private clusters, etc.) because nobody's rolled out enough A100s at this point. If you look at any paper from OpenAI et al. (basically: not Google), you'll see performance numbers for large V100 clusters.
Yes and you'll see parameters tuned for V100, not parameters tuned for m1 somehow limping along on a V100 in emulation mode.

I wouldn't complain about a benchmark executing any real world SOTA model on m1 and V100, but those will most likely not even run on the M1 due to memory constraints.

So this article is like using an ios game to evaluate a Mac pro. You can do it, but it's not really useful.

You can count the number of GPUs having more than M1 memory(16 GB) in a single hand.
Isn't the M1 GPU memory shared with everything else? Can the GPU realistically used that much? Won't the OS and base apps use up at least 2-3GB?
The M1 can only address 8 GB with its NPU/GPU.
> The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.

V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the rate of fp32), getting ~30 TFLOPS, and tensor cores which can achieve up to 4x that for matrix multiplications.

For the first graph:

  trainable parameters: 2236682
So it's a toy model...
Many models of that size are in serious productive use.
Even the RTX 3090 is double the price of an M1 for just 1 card.

The V100 is almost 5-10x the price of an M1.