Hacker News new | ask | show | jobs
by morphle 538 days ago
> Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the Mac pro

We both should restate and specify the calculation for each different Apple Silicon chip and the PCB/machine model it is wired onto.

The $599 M4 Mac mini base model networking (aggregated Wifi, USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps. Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number is to high because the 2x Gen4x16 is shared/oversubscribed with the other PCIe lanes for x8 PCIe slots, SSD and Thunderbolt. You need to measure/benchmark it, not read the marketing PR.

I estimate the $1400 M4 Pro Mac mini networking bandwidth by adding the external WiFi, 10 Gbps Ethernet, two USC-C ports (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps) but subtracting the PCIe 64 Gbps limit and not counting the internal SSD. Two $599 M4 Mac mini base models are faster and cheaper than one M4 Pro Mac mini.

The point of the precise actual measurements I did of the trillion opereations per second and the billion of bits per second networking/interconnect of the M4 Mac mini against all the other Apple silicon machines is to find which package (chip plus pcb plus case) has the best price/performance/watt balanced against them networked together. On januari 2025 you can build the cheapest fastest supercomputer in the world from just off the shelf M4 16Gb Mac mini base models with 10G Ethernet, Mikrotek 100G switches and a few FPGA's. It would outperform all Nvidia, Cerebras, Tenstorrent and datacenter clusters I know of, mainly because of the low power Apple Silicon.

Note that the M4 has only 1,2 Tips unified memory bandwidth and the M4 Pro has double that. The 8 Tops unified memory bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB DRAM. Without it you cant's reach 50 trillion operations per second. A Mac Studio has only around 190 Gbps external networking bandwidth but does not reach 43 trillion TOPS, as does the 720 Gbps of your Mac Pro estimate. By reverse engineering the instruction set you could squeeze a few percent extra performance out of this M4 cluster.

The 43 trillion TOPS of the M4 itself is an estimate. The ANE does 34 TOPS, the CPU less than 5 TOP depending on float type and we have no reliable benchmarks for the CPU floating point.

2 comments

It's very weird to add together all kinds of very different networking solutions (WiFi, wired ethernet, TB) and talk about their aggregate potential bandwidth as a single number.
Adding together all the different standards/feature sets a chip supports and then aggregating the bandwidth into a single number is actually a very reasonable way to arrive at an approximation for total chip computational throughput.

Ultimately, unless the chip architecture is oversubscribed or overloaded (unsure what the right term is), the features are all meant to be used simultaneously and thus the bits being read/written have to come from somewhere.

That somewhere is a % of the total throughput of the chip.

Stated another way — people forget that there’s almost always a single piece of silicon backing the total bandwidth throughput of modern computing devices regardless of what ‘standard’ is being used.

The pcie configuration was taken from the mac pro and it's m2 ultra. https://www.apple.com/mac-pro/

I'd assume the mac mini has a less extensive pcie/tb subsystem.

No idea what people are doing with all those pcie slots except for nvme cards. I wonder how hard it would be to talk to a pcie fpga.

You use SerDes high speed serial links (up to 224 Gbps in 2025) to communicate between chips. A PCIe lane is just a Serdes with a 30% packet protocol overhead that uses DMA to copy bytes between to SRAM or DRAM buffers.

You aggregate PCIe lanes (x16, x8, x4/Thunderbolt, x1). You could also built mesh networks from SerDes but now instead of PCIe switches You would need SerDes switches or routers (Ethernet, NVlink, Infiniband).

You need those high speed links between chips for much more than SSD/NVME cards. Other NAS, Processors, Ethernet/internet, Camera, Wifi, Optics, DRAM, SRAM, power etc. For intercore communication (between processors or between chiplets), between networked PCB's, between DRAM chips (DDR5 is just another SerDes protocol), Flash Chips, camera chips, etc. Any other chip at faster then 250 Mbps speeds.

I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

My talk [1] on Wafer Scale Integration and free space optics goes deeper into how and why SerDes and PCIe will be replaced by fiber optics and free space optics for power reasons. I'm sure several parallel 2 Ghz optic lambdas per fiber (but no SerDes!) will be the next step in Apple Silicon as well: the M4 power budget already is mostly in the off-chip SerDes/Thunderbolt networking links.

[1] https://vimeo.com/731037615

> I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

That sounds super interesting, do you happen to have some further information on that? Is it just a bunch of FPGAs issuing DMA TLPs?

sounds (at least at a high level) similar to EXO[1]

[1] https://github.com/exo-explore/exo

Here a video of testing Exo to run huge LLMs on a cluster of M4 Macs[1] more cheaply than with a cluster of NVDIA RTX 4090s.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

They show a test-run of a 1B llama-3.2 model. Doesn't that fit in a single mac? Distributing the workload in this case must be slower than running it on a single machine.

However, this is interesting and I'm confused why aren't they showcasing the test-run of a larger model that actually necessitates distributing the workload across the cluster.

It is not the first time they built super computers from off the shelf Apple machines [1].

M4 supercomputers are cheaper and it also will be lower Capex and Apex for most datacenter hardware.

>do you happen to have some further information on that?

Yes, the information is in my highly detailed custom documentation for the programmers and buyers of 'my' Apple Silicon super computer, Squeak and Ometa DSL programming languages and adaptive compiler. You can contact me for this highly technical report and several scientific papers (email in my profile).

Do you know of people who might buy a super computer based on better specifications? Or even just buyers who will go for 'the lowest Capex and the lowest Opex supercomputer in 2025-2027'?

Because the problem with HPC is that almost all funders and managers buy supercomputers with a safe brand name (Nvidia, AMD, Intel) at triple the cost and seldom from a super computer researcher as myself. But some do, if they understand why. I have been designing, selling, programming and operating super computers since 1984 (I was 20 years old then), this M4 Apple Silicon Cluster will be my ninth supercomputer. I prefer to build them from the ground up with our own chip and wafer scale integration designs but when an off-the-shelf chip is good enough I'll sell that instead. Price/Performance/Watt is what counts, ease of programming is a secondary consideration for what performance you achieve. Alan Kay argues you should rewrite your software from scratch [2] and do your own hardware [3] so that is what I've done sinds I learned from him.

>Is it just a bunch of FPGAs issuing DMA TLPs?

No. The FPGA's are optional for when you want to flatten the inter-core (=inter-SRAM cache) networking with switches or routers to a shorter hop topology for the message passing like a Slim fly diameter two hop topology [4].

DMA (Direct Memory Access) TLPs (Transaction Layer Packets) are one of the worst ways of doing inter-core and inter-SRAM communication and on PCIe it has a huge 30% protocol overhead at triple the cost. Intel (and most other chip companies like NVIDIA, Altera, AMD/XILINX) can't design proper chips because they don't want to learn about software [2]. Apple Silicon is marginally better.

You should use pure message passing between any process, preferably in a programming language and a VM that uses pure message passing at the lowest level (Squeak, Erlang). Even better if you then map those software messages directly to message passing hardware as in my custom chips [3].

The reason to reverse Apple Silicon instructions for CPU, GPU and ANE are to be able to adapt my adaptive compiler to M4 chips but also to repurpose PCIe for low level message passing with much better performance and latency than DMA TLPs.

To conclude, if you want to get the cheapest Capex and Opex M4 Mac mini supercomputer you need to rewrite your supercomputing software in a high level language and message passing system like the parallel Squeak Smalltalk VM [3] with adaptive load balancing compilation. C, C++, Swift, MPI or CUDA would result in sub-optimal software performance and orders of magnitude more lines of code when optimal performance of parallel software is the goal.

[1] https://en.wikipedia.org/wiki/System_X_(supercomputer)

[2] https://www.youtube.com/watch?v=ubaX1Smg6pY

[3] https://vimeo.com/731037615

[4] https://www.youtube.com/watch?v=rLjMrIWHsxs

I forgot to add links to talk [5] by IBM Research on massively parallel Squeak Smalltalk and why it might be relevant for Apple Silicon reverse engineering and M4 clusters.

Talk [6] on free space optical interconnects without SerDes some day showing up on low power Apple Silicon (around M6-M8 models).

[5] https://www.youtube.com/watch?v=GBtqQwcJoN0

[6] https://www.youtube.com/watch?v=-dQoImLNgWs