Hacker News new | ask | show | jobs
by TomVDB 1366 days ago
AMD's decision to have different architectures for gaming and datacenter is still a major mystery. It's clear from Nvidia's product line that there's no reason to do so. (And, yes, Hopper and Ada are different names, but there was nothing in today's announcement that makes me believe that Ada and Hopper are a bifurcation in core architecture.)
3 comments

Moreover, CDNA is not a new architecture, but just a rebranding of GCN.

CDNA 1 had little changes over the previous GCN variant, except for the addition of matrix operations, which have double throughput compared to the vector operations, like NVIDIA did before (the so-called "tensor" cores of NVIDIA GPUs).

CDNA 2 had more important changes, with the double-precision operations becoming the main operations around which the compute units are structured, but the overall structure of the compute units has remained the same as in the first GCN GPUs from 2012.

The changes made in RDNA vs. GCN/CDNA would have been as useful in scientific computing applications as they are in the gaming GPUs and RDNA is also defined to potentially have fast double-precision operations, even if no such RDNA GPU has been designed yet.

I suppose that the reason why AMD has continued with GCN for the datacenter GPUs was their weakness in software development. Until today ROCm and the other AMD libraries and software tools for GPU computational applications have good support only for GCN/CDNA GPUs, while the support for RDNA GPUs was non-existent in the beginning and very feeble now.

So I assume that they have kept GCN rebranded as CDNA for datacenter applications because they were not ready to develop appropriate software tools for RDNA.

Some guy on Reddit claiming to be an AMD engineer was telling me a year or so ago that RDNA took up 30% more area per FLOP than GCN / CDNA.

That's basically the reason for the split. Video game shaders need the latency improvements from RDNA (particularly the cache, but also the pipeline level latency improvements, each clock an instruction completed rather than once every 4 clocks like GCN).

But supercomputers care more about bandwidth. The once every 4 clocks on GCN/CDNA is far denser and more power efficient.

GCN/CDNA is denser with more FLOPS.

RDNA has more cache and runs with far less latency. Like 1/4th the latency of CDNA/Vega. This makes it faster for video game shaders in practice.

Density, I can accept.

But what kind of latency are we talking about here?

CDNA has 16-wide SIMD units that retires 1 64-wide warp instruction every 4 clock cycles.

RDNA has a 32-wide SIMD unit that retires 1 32-wide warp every clock cycle. (It's uncanny how similar it to to Nvidia's Maxwell and Pascal architecture.)

Your 1/4 number makes me think that you're talking about a latency that has nothing to do with reads from memory, but with the rate at which instructions are retired? Or does it have to with the depth of the instruction pipeline? As long as there's sufficient occupancy, a latency difference of a few clock cycles shouldn't mean anything in the context of a thousand clock cycle latency for accessing DRAM?

> thousand clock cycle latency for accessing DRAM?

That's what's faster.

Vega64 accesses HBM in like 500 nanoseconds. (https://www.reddit.com/r/ROCm/comments/iy2rfw/752_clock_tick...)

RDNA2 accesses GDDR6 in like 200 nanoseconds. (https://www.techpowerup.com/281178/gpu-memory-latency-tested...)

EDIT: So it looks like my memory was bad. I could have sworn RDNA2 was faster (Maybe I was thinking of the faster L1/L2 caches of RDNA?) Either way, its clear that Vega/GCN has much, much worse memory latency. I've updated the numbers above and also edited this post a few times as I looked stuff up.

Thanks for that.

The weird part is that this latency difference has to be due to a terrible MC design by AMD, because there's not a huge difference in latency between any of the current DRAM technologies: the interface between HBM and GDDR (and regular DDR) is different, but the underlying method of accessing the data is similar enough for the access latency to be very similar as well.

Or... supercomputer users don't care about latency in GCN/CDNA applications.

500ns to access main memory, and lol 120 nanoseconds to access L1 cache is pretty awful. CPUs can access RAM in less latency than Vega/GCN can access L1 cache. Indeed, RDNA's main-memory access is approaching Vega/GCN's L2 latency.

----------

This has to be an explicit design decision on behalf of AMD's team to push GFLOPS higher and higher. But as I stated earlier: video game programmers want faster latency on their shaders. "More like NVidia", as you put it.

Seemingly, the supercomputer market is willing to put up with these bad latency scores.

But why would game programmers care about shader core latency??? I seriously don't understand.

We're not talking here about the latency that gamers care about, the one that's measured in milliseconds.

I've never seen any literature that complained about load/store access latency in the shader core. It's just so low level...

Nvidia have been making different architecture for gaming and datacenter for few generations now. Volta and Turing, Ampere and Ampere(called the same, different architectures on different node). And Hopper with Lovelace are different architectures. SMs are built differently, different cache amounts, different amount of shading units per SM, different rate between FP16/FP32, no RT cores in Hopper and I can go on and on. They are different architectures where some elements are the same.
No, the NVIDIA datacenter and gaming GPUs do not have different architectures.

They have some differences besides the different set of implemented features, e.g. ECC memory or FP64 speed, which are caused much less by their target market than by the offset in time between their designs, which gives the opportunity to add more improvements in whichever comes later.

The architectural differences between NVIDIA datacenter and gaming GPUs of the same generation are much less than between different NVIDIA GPU generations.

This can be obviously seen in the CUDA version numbers, which correspond to lists of implemented features.

For example, datacenter Volta is 7.0, automotive Volta is 7.2 and gaming Turing is 7.5, while different versions of Ampere are 8.0, 8.6 and 8.7.

The differences between any Ampere and any Volta/Turing are larger than between datacenter Volta and gaming Turing, or between datacenter Ampere and gaming Ampere.

The differences between two successive NVIDIA generations can be as large as between AMD CDNA and RDNA, while the differences between datacenter and gaming NVIDIA GPUs are less than between two successive generations of AMD RDNA or AMD CDNA.

I don't agree.

Turing is an evolution of Volta. In fact, in the CUDA slides of Turing, they mention explicitly that Turing shaders are binary compatible with Volta, and that's very clear from the whitepapers as well.

Ampere A100 and Ampere GeForce have the same core architecture as well.

The only differences are in HPC features (MIG, ECC), FP64, the beefiness of the tensor cores, and the lack of RTX cores on HPC units.

The jury is still out on Hopper vs Lovelace. Today's presentation definitely points to a similar difference as between A100 and Ampere GeForce.

It's more: the architectures are the same with some minor differences.

You can also see this with the SM feature levels:

Volta: SM 70, Turing SM 75

Ampere: SM 80 (A100) and SM 86 (GeForce)

Turing is an evolution of Volta, but they are different architectures.

A100 and GA102 DO NOT have same core architecture. 192KB of L1 cache in A100 SM, 128KB in GA102 SM. That already means that it is not the same SM. And there are other differences. For example Volta started featuring second datapath that could process one INT32 instruction in addition to floating point instructions. This datapath was upgraded in GA102 so now it can handle FP32 instructions as well(not FP16, only first datapath can process them). A100 doesn't have this improvement, that's why we see such drastic(basically 2x) difference in FP32 flops between A100 and GA102. It is not a "minor difference" and neither is a huge difference in L2 cache(40MB vs 6MB). It's a different architecture on a different node designed by a different team.

GP100 and GP GeForce has a different shared memory structure as well, so much so that GP100 was listed as having 30 SMs instead of 60 in some Nvidia presentations. But the base architecture (ISA, instruction delays, …) were the same.

It’s true tbat GA102 has double the FP32 units, but the way they works is very similar to the way SMs have 2x FP16 in that you need to go out of your way to benefit front them. Benchmark show this as well.

I like to think that Nvidia’s SM version nomenclature is a pretty good hint, but I guess it just boils down to personal opinion about what constitutes a base architecture.

AMD as well. The main difference being that Nvidia kills you big time with the damn licensing (often more expensive than the very pricy card itself) while AMD does not. Quite unfortunate we do not have more budget options for these types of cards as it would be pretty cool to have a bunch of VM's or containers with access to "discrete" graphics
Nvidia's datacenter product licensing costs are beyond onerous, but even worse to me is that their license server (both its on-premise and cloud version) is fiddly and sometimes just plain broken. Losing your license lease makes the card go into super low performance hibernation mode, which means that dealing with the licensing server is not just about maintaining compliance -- it's about keeping your service up.

It's a bit of a mystery to me how anyone can run a high availability service that relies on Nvidia datacenter GPUs. Even if you somehow get it all sorted out, if there was ANY other option I would take it.