Hacker News new | ask | show | jobs
by inkyoto 1477 days ago
> No, it's because they lost the lead in fabs. AMD passed Intel, was it because of their designs? No, it's because they use TSMC's fabs and TSMC passed Intel. Apple blew everyone away with M1, was it because of their designs? No, it's because they paid TSMC boatloads of money for exclusivity on their latest fabs.

The fixation on the fab process is bewildering. Yes, it does help, but it is also an optimisation step that is decoupled from and that bears no relevance on the chip design. Yes, the smaller node size also brings the increased density along and an increased number of things that can be whacked into the same sized piece of silicon, but it will not magically improve the overall system performance or result in the linear architecture scalability.

The article is specifically calling out a potentially decreased ROB size in M2 cores, and ARMv9 also potentially not arriving until M3 which are crucial to the speed or software performance. There is absolutely nothing the fab process can do to make SVE2 and matrix instructions automagically appear in lithographic chip designs – those are the «silicon» design time decisions. As we have recently been seeing more and more practical, mainstream use cases of the advanced use of the SIMD instructions at the C/C++/Rust runtime level that bring an order of magnitude level performance gains, having the SVE2 implementation at the ISA level is becoming somewhat critical.

3 comments

Stuff like adding SVE2 can be great for specific applications but it's really marginal when looking at whole system performance. What's not marginal are the improvements in power efficiency and room for more cache that come with new process nodes. These chips are power constrained in almost everything they do, because of heat dissipation or battery life or both. Less power and more cache benefits everything automatically, not just the very few things that actually start using new SIMD instructions or other new hardware blocks each year.
> Stuff like SVE2 is really marginal when looking at whole system performance.

It is not. A recent paper (https://arxiv.org/pdf/2205.05982.pdf) from Google engineering has compared performance of a vectorised (SIMD) vs non-vectorised implementation of the quick sort in the Highway library as well as the performance difference of the AVX-512 vs NEON/SVE1 implementations. By switching to the SIMD processing alone, the 9-19x speedup has been reported, depending on the SIMD unit size (32/64/128-bit numbers have been sampled and measured up). Even the smallest of the two, the 9x perfomance gain factor, is far from being marginal.

On the SIMD unit size of things, the performance difference between AVX-512 (the average of 1120 Mb/sec has been measured) and NEON implementation (the 478 Mb/sec throughput on average) is 2.4x smaller for NEON/SVE1 largely due to the smaller width of the units of processing. Again, the 2.4x factor is not in the marginal territory.

> What's not marginal is the improvements in power efficiency that come with new process nodes.

And that is an optimisation step, albeit a very important one. However, it will not make a quick sort implementation run 2.4x faster alone.

You completely ignored the "whole system performance" part of my statement. What percentage of your CPU time is spent running SIMD-optimized implementations of Quicksort? Now apply Amdahl's law.
«Whole system performance» is a meaningless term as it is a function of many, usually poorly controlled, input variables, and your whole system is different from my whole system. If my VPN tunnel allows me to have faster transfer speeds simply by virtue of having ISA assisted optimisations in the cryptographic library it uses, the net result will be very noticeable to me but perhaps not for you unless you also have to use the same VPN client.

Even the web browser you are using right now to comment on HN likely makes use of the very same Highway library (Chrome and Firefox certainly do, unsure about Safari) the speedup gains have been reported for. The «overall» browser performance will also improve as the result due to it receiving gains transparently, by simply dropping an optimised implementation into the browser build.

The idea that my overall browser performance might improve in a noticeable way because it is switching from NEON optimized Quicksort to SVE2 optimized Quicksort is simply laughable. On the other hand, switching to a processor fabbed on a better process node could easily have a noticeable impact on overall browser performance, or battery life while browsing, or both.
It is certainly not laughable to me. You have singled out Quicksort as the sole example of performance gains whereas I have used it as a single isolated example of very large performance gains that can be had. SIMD instructions have seen a lot of other mainstream use cases recently which also includes the memory copying or memcpy(3) optimisations amongst others. Your browser has a Javascript engine, and since it a Javascript engine, it has a garbage collector. Garbage collectors move memory blocks around all the time, and a SIMD optimised memcpy will yield substantial performance gains. Or faster JSON processing. Therefore, SIMD + an improved fab process will result in much larger overall performance gains for you and me as browser users as opposed to an improved fab process alone. It is also a realistic example of the «whole system performance» improvements if the browser is treated as a «whole system».

And an optimised QuickSort can also come in handy if one pokes around a large browser history or uses it as a knowledge base, which I do and use it on a regular basis. My browser keeps a uninterrupted record of all visited websites over the last 15+ years and being able to zoom in on a particular time span to find something within that temporal range quickly is important to me. I am almost certain that a sorting of sorts is involved somewhere behind the scenes.

Soo... basically a 2x speedup in going from 4x128b to 2x512b ALUs, after discounting the frequency difference. But realistically, Intel's client configurations are 3x256b, which is only 25-40% faster in that paper.

(I suspect any application doing enough quicksort that the 2x speedup is significant, would be even happier going slightly off-core to a coprocessor more specialized in vector processing, like Hwacha. There's plenty of space between "tightly-coupled CPU SIMD" and "GPU" that I think makes more sense than needing to implement 512-bit registers in little cores.)

> Soo... basically a 2x speedup in going from 4x128b to 2x512b ALUs, after discounting the frequency difference. But realistically, Intel's client configurations are 3x256b, which is only 25-40% faster in that paper.

2.4x difference was, in fact, reported, however I still find it somewhat difficult to interpret the reported results. The processing unit size difference alone and the number of LU's can't account for such a big difference in transfer speeds as the M1 Max that was used in the assessment has a very wide memory bus (256 bit wide for a performance core cluster or 512 bit wide for the entire SoC) as well as unusually large L1-D cache and a large L2 cache, with both caches having deep TLB's. The test set they used could also fully fit into the L2 cache. I have asked the Google engineer a question in a separate thread about what else could influence the observed performance difference but have not received a satisfactory explanation.

I am that engineer; happy to clarify/go into more detail. First, just to ensure we're on the same page: the speeds we quote are sort goodput, i.e. amount of sorted data produced per time. Memory bus is only incidental to keeping the vector units fed.

The key bottleneck is partitioning. AVX-512 does really well there because it has dedicated compressstore instructions, and it's actually even faster to partition a vector via vperm* (because we only need to do that once, whereas two compressstore are required to partition). So AVX-512 reaches >25 GB/s partition throughput per core; it's instead limited by the memory bandwidth each core can access (around 11 GB/s if a single core is active, less when all are competing for the total "128 GB/s").

By contrast, NEON for example in the M1 has 128-bit vectors. Its "4 vector units" (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports) are definitely not as good as actual 512 bit vectors, because partitioning only has a left and right side, and we don't have enough ILP for each of those to keep 2 vector units busy. Hence NEON reaches 11 GB/s partition throughput. It would seem like this matches Skylake, but no: once a subarray fits into cache, Skylake is freed from the memory bottleneck and is at least twice as fast there (which is a sizable fraction of the total sort time).

Does this help explain the results?

> The test set they used could also fully fit into the L2 cache.

This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?

> (even if they can actually execute all instructions concurrently, which is not clear to me and unlikely - Intel can also only execute some instructions on certain ports)

It's pretty symmetrical, moreso than Cortex-X2's 4 pipelines; there's analysis that on M1 only some floating point and crypto instructions can't execute on all 4 pipelines. [1] (TP in that table is inverse throughput)

Which means that, for example, byte permutes from tables 256b or less can actually achieve the same throughput on M1 as with Intel's AVX-512, since M1 can sustain 4x 2-register TBL per cycle. And doing the exact equivalent of a 512b vpermb (3 cycle latency, 1/cycle throughput) can be done with 5 cycles latency and 0.33/cycle throughput on M1, via 4x 4-register TBL.

Well, a vpermd in NEON would need an extra MLA to convert indexes, and vpermi2* equivalents fall off a cliff. And Intel still has p01 free, and COMPACT is SVE. But in general, a lot of the parallelism that enables AVX-512 implementations will convert directly into ILP across 128b vectors.

> This seems unlikely because we're sorting 8 MB and my understanding is that cores (unless L2==LLC) generally have private, partitioned L2 caches, so 3 MB in the case of M1. Is that incorrect?

Anandtech [2] measured the same L2 latency up to about 8MB single-core, so regardless of the details, 8MB is a pretty significant cliff on M1. Regardless, RAM bandwidth is ~60GB/s, and unlike Intel, can be just about saturated by a single core.

[1] https://dougallj.github.io/applecpu/firestorm-simd.html

[2] https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

If something is bound by memory/cache bandwidth, then increasing ALU width wouldn't help in the first place.
This is an interesting question. Has there been much uptake of such accelerators/coprocessors? One concern is that by the time the HW is ready, SW wants to do something different, perhaps fusing some other step with the sort. Another is deployability: everyone has SIMD/vectors on board, but even GPUs aren't quite everywhere nor so easy to scale out.

Also, there are now several RISC-V CPUs with 512-bit vectors, and it seems fair to call them little cores especially compared to x86 and M1/M2. Perhaps 512-bit is more feasible (and sensible) than is widely believed?

> adding SVE2 can be great for specific applications but it's really marginal when looking at whole system performance. What's not marginal are the improvements in power efficiency

Depends on the applications, I suppose. But did you know that (at least on OoO x86), the energy cost of scheduling an instruction dwarfs that of the actual computation? That is why SIMD, including SVE2, can be so important - it amortizes that cost over several elements. Let's spend (more of) our energy budget on actual work.

Is it really just "very few things that actually start using new SIMD"? I'm not a huge fan of autovectorization, but even that is able to vectorize some fraction of STL algorithms. And there are several widely used libraries, including image/video codecs and encryption, that use SIMD and wouldn't be feasible otherwise.

Android flagships are shipping with SVE2 as of this year, which I actually didn't realize until like two weeks ago because there's been nearly zero buzz about it. What's SVE2 being used for over NEON as of now?
Low level runtime optimisation that yields substantial performance gains in the user facing or system level software, ranging from cryptography through to data processing algorithms and very high throughput JSON parsing.

Take OpenSSL as an isolated example. By simply fiddling with the C compiler flags to allow it to use NEON on M1, the sha256 calculation speed-up is 4x for 128 and 256 block sizes, with performance gains quickly tapering off for larger block sizes and resutling in a modest 10% increase only. And that performance increase happens without the involvement of hash functions having been manually optimised for NEON/SVE1.

SVE2 with its variable vector size support could improve performance for larger unit sizes. Perhaps it is the time to spin up a Graviton3 instance and poke around with clang/gcc to see how actually good or faster the SVE2 is.

Yeah that's NEON. And there's instructions that literally calculate SHA256 so generalizing that is moot. My point was first, what real benchmarks are there of SVE2's benefits over NEON with mainstream CPUs that M2 would compete against? Unlike AVX-512, NEON was already pretty rich, so the new instructions have rather specialized usefulness.

Because outside of servers where little cores don't exist, 256b ALUs in big cores mean 256b registers in little cores, and Cortex-A510 is way smaller than Gracemont. And then you're giving Samsung another opportunity to screw up big.LITTLE...

And even the server CPUs with SVE are 2x256b except A64FX which is HPC exclusive, so no better than 4x128b.

SVE2 does not increase the maximum speed. That depends only on the width and number of the ALUs, on the number of cores and on the clock frequency.

The purpose of SVE2 is to simplify the writing of the software that exploits the data parallelism, both when that is done manually and when that is done automatically by an autovectorizing compiler.

With SVE2 it should become much easier to deal with data structures where the sizes and the alignments are not multiples of the ALU width and it will also no longer be necessary to write many alternative code paths, to take advantage of any future better CPUs, like when optimizing for Intel SSE/AVX/AVX2/AVX-512.

There are still a majority of programs that do not utilize as frequently as possible the existing SIMD units. With SVE2, their number should diminish.

Fab process is very important. The fast design is nothing if you can't build it.