Hacker News new | ask | show | jobs
by adrian_b 1478 days ago
Besides being the first system exceeding the 1 Exaflop/s threshold, what is more impressive is that this is also the system with the highest ratio between computational speed and power consumption (i.e. the AMD devices have the first place in both Top500 and Green500).

The AMD GPUs with the CDNA ISA have surpassed in energy efficiency both the NVIDIA A100 GPUs and the Fujitsu ARM with SVE CPUs, which had been the best previously.

Unfortunately, AMD has stopped selling at retail such GPUs suitable for double-precision computations.

Until 5 or 6 years ago, the AMD GPUs were neither the fastest nor the most energy-efficient, but they had by far the best performance per dollar of any devices that could be used for double-precision floating-point computations.

However, when they have made the transition to RDNA, they have separated their gaming and datacenter GPUs. The former are useless for DP computations and the latter cannot be bought by individuals or small companies.

2 comments

> The former are useless for DP computations

Looking at “double-precision GFlops” columns there [1] they don’t seem terribly bad, more than twice as fast compared to similar nVidia chips [2]

While specialized extremely expensive GPUs from both vendors are way faster with many TFlops of FP64 compute throughput, I wouldn’t call high-end consumer GPUs useless for FP64 workloads.

The compute speed is not terribly bad, and due to some architectural features (ridiculously high RAM bandwidth, RAM latency hiding by switching threads) in my experience they can still deliver a large win compared to CPUs of comparable prices, even in FP64 tasks.

[1] https://en.wikipedia.org/wiki/Radeon_RX_6000_series#Desktop

[2] https://en.wikipedia.org/wiki/GeForce_30_series#GeForce_30_(...

"Useless" means that both DP Gflops/s/W and DP Gflops/s/$ are worse for the modern AMD and NVIDIA gaming GPUs, than for many CPUs, so the latter are a better choice for such computations.

The opposite relationship between many AMD GPUs and the available CPUs was true until 5-6 years ago, while NVIDIA had reduced the DP computation abilities of their non-datacenter GPUs many years before AMD, despite their previous aggressive claims about GPGPU being the future of computation, which eventually proved to be true only for companies and governments with exceedingly deep pockets.

My desktop PC has Ryzen 7 5700G, on paper it can do 486 GFlops FP64 (8 cores at 3.8 GHz base frequency, two 4-wide FMAs every cycle). However, that would require 2TB/sec memory bandwidth, while the actual figure is 51 GB/second of that bandwidth. For large computational tasks where the source data doesn’t fit in caches, the CPU can only achieve a small fraction of the theoretical peak performance ‘coz bottlenecked by memory.

The memory in graphics cards is an order of magnitude faster, my current one has 480 GB/sec of that bandwidth. For this reason, even gaming GPUs can be much faster than CPUs on some workloads, despite the theoretical peak FP64 GFlops number is about the same.

You are right that there are problems whose solving speed is limited by the memory bandwidth, and for such problems GPUs may be better than CPUs.

Nevertheless, many of the problems of this kind require more memory than the 8 GB or 16 GB that are available on cheap GPUs, so the CPUs remain better for those.

On the other hand, there are a lot of problems whose time-consuming part can be reduced to multiplications of dense matrices. During the solution of all such problems, the CPUs will reach a large fraction of their maximum computational speed, regardless whether the operands fit in the caches or not (when they do not fit, the operations can be decomposed into sub-operations on cache-sized blocks, and in such algorithms the cache lines are reused enough times so that the time used for transfers does not matter).

I guess I was lucky with the CAM/CAE software I’m working on. We don’t have too many GB of data, the stuff fits in VRAM of inexpensive consumer cards.

One typical problem is multiplying dense vector by a sparse matrix. Unlike multiplication of two dense matrices, I don’t think it’s possible to decompose into manageable pieces which would fit into caches to saturate the FP64 math of the CPU cores.

We have tested our software on nVidia Teslas in a cloud (the expensive ones with many theoretical TFlops of FP64 compute), the performance wasn’t too impressive.

SP is sixteen times the performance of DP here for no other reasons then market segmentation. Nvidia might have started that, but that's no reason not to call AMD out for it.
fp32 uses much less silicon and power than fp64. I think the scaling is roughly quadratic in both, so 4x performance is free.

I vaguely remember a consumer card having 1/4 the fp64 units of a similar data center one so that would get the 16x on paper.

Memory bandwidth / register file size would suggest another 2x from moving less data. My working heuristic on these things is compute is free because I fail to saturate the memory bus but no doubt some applications do actually run into that slowdown in practice. Matrix multiply probably does.

Computational speed is important, but more important is the data transfer speed. At least in ML. Is AMD the best for data transfer speed?