Hacker News new | ask | show | jobs
by bullen 1547 days ago
I think we're all missing the forest because all the cores are in the way:

The contention on that memory means that only segregated non-cooporative as in not "joint parallel on the same memory atomic" will scale on this hardware better than on a 4-core vanilla Xeon from 2018 per watt.

So you might aswell buy 20 Jetson Nanos and connect them over the network.

Let that sink in... NOTHING is improving at all... there is ZERO point to any hardware that CAN be released for eternity at this point.

Time to learn JavaSE and roll up those sleves... electricity prices are never coming down (in real terms) no matter how high the interest rate.

As for GPUs, I'm calling it now: nothing will dethrone the 1030 in Gflops/W in general and below 30W in particular; DDR4 or DDR5, doesn't matter.

Memory is the latency bottleneck since DDR3.

Please respect the comment on downvote principle. Otherwise you don't really exist; in a quantum physical way anyway.

4 comments

1030 has been dethroned a while ago. Apple G13 delivers 260GFLOPS/W in a general-purpose GPU. I mean, their phone has more GPU FLOPS than a 1030.
Nope, 1030 has 37 Gflops/W... G13 786/20W = 40... and that's 14nm vs 5nm... still I'm pretty sure there are things the 1030 can do that the A13 will struggle with.

Game Over!

G13 (in the 8-core/1024 ALU config as in M1) delivers 2.6TFLOPS with sustained power consumption of 10W. That's almost an order of magnitude better than 1030. Sure, node definitely matters, but going from 14nm to 5nm cannot explain the massive power efficiency difference alone.

What are the things that 1030 can do that G13 will struggle with?

Your numbers are completely wrong. The claim of 2.6 Tflops for the M1 was independently verified.

https://www.realworldtech.com/forum/?threadid=197759&curpost...

There is not a single page on the whole internet that states Gflops and Watt on the same page, I did 2 googlings: "apple g13 gflops" and "apple g13 watt"... the results where completely disturbing seen this info should be clearly available. Like when you google 1030 gflops and watt you get all links on google linking to pages stating both figures and they are the same...

M1 comes is MANY flavours with different watt and gflops.

And for CPU Glops I had to get friends to measure it themself: 2.5Glops/W compared to Raspberry 4 2Gflops/W and this time it's 5nm vs 28nm.

Please give me official Gflops and Watt sources and we can discuss.

The page you link is NOT clearly stating watts in a clear way.

> What are the things that 1030 can do that G13 will struggle with?

I real life when you develop games for real hardware you notice their real limitations like fill rates and how they scale different behaviours because they have enough registers to do things in one blow or they have to remember things. It's complicated, but eventually you realize you can just benchmark things for your own needs and for me 1030 is for all purposes as good as 1050 so far: 2000 non-instanced animated characters on 1030 at 30W vs like 2500 for 80W 1050!

Without knowing, I'm pretty sure the M1 cannot do more than 1000 at whatever watt it uses... not that I would ever compile anything for a machine where I need to sign the executable.

“Official sources” in this case is testing done by me personally. I am the author of the post on RWT linked previously. I would be happy to share my benchmarking code for your scrutiny if you want. The M1 variant tested was base M1 in a 13” MacBook Pro.

I don’t know what your friends have tested but the results make zero sense. Firestorm reaches 1700 points in GB5 at 5W. Pi 4 is under 300 at similar wattage.

How do you measure watts?

First answer on google: "maximal power consumption is around 50 watts"

Firestorm is GPU (again google has little info) I'm talking CPU for the Raspberry.

The Raspberry 4 GPU uses 1W. You are conflating because of sunk cost.

You need to compare the same things, apples to apples (no pun) one CPU core on Raspberry consumes 1W on the M1 they are 4W

GPU is 1W vs 5W (if you are correct which I HIGHLY doubt, I suspect 20W for the GPU alone, wikipedia states 39 watts at maximum load so yes 20W for the GPU)!

You need to start looking at the world objectively and understand how it really works, because peak energy is not going to be forgiving if you don't.

Well, it seems like you made up your mind without doing any testing or educating yourself, so I am not quite sure what I can do to help you. Already your entirely nonsensical comment of “needing to sign the executable” speaks volumes. Why did I never need to sign anything despite building software on M1 machines daily for the last year? I wonder…
All code that runs on Apple silicon must be signed. If you don't explicitly sign your executable, the linker will inject an ad-hoc signature into your binary for you.
I wouldn’t say it’s all over. People have been saying that it’s all over for longer than I can remember, and there is always someone smarter and more clever. The GPU space is ripe for disruption, the memory space is ripe for disruption, and the CPU space is being disrupted presently. For all I know, some genius has it worked out now and is going to launch a new startup sometime this month.
Aren't you are ignoring use cases where all cores read shared data, but rarely contentiously write to it. You should get much more read bandwidth and latency than over a network.
Sure, but my point is: why cram more and more cores into the same SoC if they can't talk to each other more efficiently than separate computers over ethernet?
This point feels like arguing why any organization would seek density in computing if they can just buy more of something and spread it out. I don't know about you but I've saved a ton of effort on design complexity by not distributing workloads when it can be avoided (but distributed computing is a solved problem).

I recognize what you are calling out/that performance will be the same on some workloads if you distribute or not. I would just point out less manufacturing causes less e-waste/I would rather live in a world where Nvidia sells 50 million 10*0 cards, than 500 million 1030 cards to create the same amount of compute in the world. It's not just the power costs to consider (but it could be there is a reality where running 500 million 1030s for their lifetime wastes so much less power, that the manufacturing costs to the planet are worth it).

Your point is rooted in wrong facts. On-chip fabrics are much more efficient than separate computers over ethernet. More energy efficient and lower latency.

Not only that. On-chip gives you high precision synchronous time (all cores observe the same time) so you can use synchronous distributed algorithms that are unsuitable for ethernet networks.

This type of hardware allows for much better solutions to some problems.

Latency! Nanoseconds versus microseconds or even milliseconds makes a huge difference.
Performance per watt isn’t so useful for a GPU. People training ML algorithms would gladly increase power consumption if they could train larger models or train models faster.
And that's exactly my point: they can't. Power does not solve contention and latency! It's over, permanently... (or atleast until some photon/quantum alternative, which honestly we don't have the energy to imagine, let alone manufacture, anymore)