Hacker News new | ask | show | jobs
by fhqghds 2186 days ago
Get ready for a surprise then: all those FLOPS are coming from the ARM cores.... This beast has no GPUs:

https://postk-web.r-ccs.riken.jp/spec.html

4 comments

It looks like this is not an ARM core, but a Fujitsu implementation of the Arm v8-A instruction set and Fujitsu-developed Scaleable Vector Extension. Most likely the latter is doing all the heavy lifting.

https://www.fujitsu.com/global/about/resources/news/press-re...

>A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance.

The text you linked to actually says that the SVE was developed cooperatively by Fujitsu and ARM, without, however, going into details about who did what.
There are words floating that A64fx is basically a SPARC with ARM ISA without much ARM IP in it, no idea how accurate but intriguing
So looking at anandtech's breakdown the CPUs are closer to a knights landing 'CPU/GPU' than a traditional CPU (currently). They also have a ton of HBM2 right next to the dies so this should be insanely fast as they can feed those cores very very quickly regardless of how fast each core is by clock and pipeline. That should massively reduce stalls.
The "traditional CPU" portion of the core is a bit more capable than KNL, but yeah, that's roughly accurate.
Oh agreed, but honestly what makes this so interesting is how tuned it is. I'm honestly surprised we haven't seen Intel or AMD ship an HPC CPU with on package HBM2 yet.
Besides FLOP/Watt what's also very interesting here is the FLOP/Byte ratio (memory bandwidth). It has kept the same balance as K computer, i.e. is geared at scientific workloads and not just benchmarks (duh, just worth pointing out here as it makes this machine quite special especially compared to Xeon based clusters - Intel IMO has dropped the ball on bandwidth since the last 5 years or so).
As an early user of KNL, I don't get the "GPU" bit. KNL runs normal x86_64 code and doesn't look that much different to the AMD Interlagos systems I once used apart from the memory architecture.
It comes from the fact that KNL came from Larrabee which was actually developed as a GPU initially (and even ran games... sort of) but was never actually released. The next revision of that was the Xeon Phi chips you used. So the connection is "Lots of small cores with lots of high bandwidth ram" although these cores are definitely superscalar where Larrabee and derivatives were not really.

https://en.wikipedia.org/wiki/Xeon_Phi https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

Sure, but people don't normally think of "GPU" in this context as just runs all your x86_64 code.
That's pretty cool! That probably means that applications will have an easier time. Looks like it has 512-bit SIMD.

I wonder what BLAS they are using, and if the contributions are open sourced.

(SVE isn't 512-bit SIMD like AVX512.) I don't know what BLAS they're using, though I know they've long worked on their own, but BLIS has gained SVE support recently, for what it's worth.
SVE is whatever width the chip designer wants, Fujitsu's implementation is 512-bit according to AnandTech
I know, but it's different apart from coming in different hardware widths, as ARM techies will gush.
Yes, SVE, like the RISC-V vector extension, is a "real" vector ISA, with things like vector length register (no need for a scalar loop epilog), scatter/gather memory ops for sparse matrix work, mask registers for if-conversion, looser alignment requirements (no/less need for loop prologues).

That being said, apart from becoming wider, AVX-NNN has also gotten more "real" vector features with every generation. The difference might not be as huge anymore.

I am really happy to have come across this post, mainly due to this fact.