Hacker News new | ask | show | jobs
by kortex 2023 days ago
https://news.ycombinator.com/item?id=25257932

My 10,000' view understanding:

- Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

- more efficient transistors. 7nm uses finFet. M1 probably uses GAAFET, which means you can cram more active gate volume in less chip footprint. This means less heat and more speed

- layout. M1 is optimized for Apple's use case. General purpose chips do more things, so they need more real estate

- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks

- M1 has tons of cache, IIRC, something like 3x the typical amount

- some fancy stuff that allows them to really optimize the reorder buffer, decoder, and branch prediction, which also leverages all that cache

6 comments

One small correction, TSCM 5nm still uses finFet Samsungs 3nm is being testing with GAAFET and has it slated for 2021. TSMCs roadmap also has it but for 2022, and intel at 2025
The fancy stuff with the reorder buffer, decoder, and branch prediction is the most important thing. The M1 is just as general purpose as any Intel/AMD CPU, and indeed even more so because it's designed to scale from cell phones to desktops.

Specialization helps when it helps, but doesn't do much on typical programs and M1 still excels on those.

Can you explain what you mean by Apple's use case vs general purpose chips?
The M1 does one thing: run MacOS and MacOS apps. They can control the vast majority of the compiled code that will be run on the chip - unlike an x86 platform where the exact same architecture is used for desktops, servers and everything in between - including Linux, windows, and Mac.

Specifically there is a reference counting optimization on the M1 that dramatically helps performance of compiled Swift apps - something only worthwhile if you know the majority of what the chip will ever do is run swift apps.

It also does that for, basically, one device (the three models available so far are almost identical. Fan/no fan, and some binning on the GPU)

That’s one further reason their system is faster: they designed a system, and the design of their CPU, GPU, etc. was driven by what the final system needed.

Other system manufacturers buy individual parts, where the manufacturer of each part extends/optimizes it with only a vague knowledge of what the system it will be used in will look like (and they don’t want to focus on one specific system, as that would mean they can sell it to fewer device manufacturers)

> The M1 does one thing: run MacOS and MacOS apps. They can control the vast majority of the compiled code that will be run on the chip - unlike an x86 platform where the exact same architecture is used for desktops, servers and everything in between

The M1 is strongly related to their A14, which runs phones and tablets.

Also: what's between a server and a desktop machine?

Could there also be hardware acceleration or built-in support for Objective-C's message-passing? I've always wondered how Apple gets decent performance with Objective-C in-spite of MP given it its complexity compared to vtables (and vtables have the advantage of being easily cachable in L1/L2).
I haven't read it, but this article might address your performance question: https://www.mikeash.com/pyblog/friday-qa-2017-06-30-dissecti...
Typical Objective-C code is mostly C (and sometimes C++). It doesn’t use message-passing in the hot loops.
> - Small feature size. M1 is a 5nm process. Intel is struggling to catch up to TSMC's 7nm process

See: https://news.ycombinator.com/item?id=25277124

>- specialization. M1 offloads oodles of compute to hardware accelerators. No idea how their code interfaces with it, but I know lots of the demos involve easy-to-accellerate tasks like codecs and neural networks

that wouldn't help in benchmarks, would it? Only in the most dishonest benchmarks would they compare x264 to a hardware h.264 encoder, for instance.

Maybe - but how do you know they are off loading to a hardware encoder?

Example: I run a benchmark for AES encryption - a modern CPU will have circuitry designed explicitly for this task and it's asm instructions. An old CPU just supporting the base x86 instructions probably doesn't have a hardware solution. Is it unfair to compare them?

If the utilisation of the hardware accelerators is completly opaque to the user (not importing special libraries) is it unfair that one CPU has specific hardware implementation for common tasks and one only has the generic circuitry?

> I run a benchmark for AES encryption - a modern CPU will have circuitry designed explicitly for this task and it's asm instructions. An old CPU just supporting the base x86 instructions probably doesn't have a hardware solution. Is it unfair to compare them?

YES! Unless you're specifically searching for the fastest AES cpu.

If you want to compare general performance this benchmark is flawed. E.g.: I could have the fastest CPU in existence, but since it happens to be lacking hardware AES circuitry, your benchmark will always show another CPU as the 'fastest'.

It's not 'unfair' or whatever. It just makes it so that you need to think better about your benchmark, what you want to measure, and what you're actually measuring. Or you need to adjust your conclusion: "this is the fastest cpu" -> "this cpu performs best on this specific task"

Thanks for the link to the old HN item. Missed it the first time. Seems like having the ability to control the entire SoC helped apple take a lot of decisions specific to how they wanted to use the chips.