| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TomVDB 1366 days ago

Density, I can accept.

But what kind of latency are we talking about here?

CDNA has 16-wide SIMD units that retires 1 64-wide warp instruction every 4 clock cycles.

RDNA has a 32-wide SIMD unit that retires 1 32-wide warp every clock cycle. (It's uncanny how similar it to to Nvidia's Maxwell and Pascal architecture.)

Your 1/4 number makes me think that you're talking about a latency that has nothing to do with reads from memory, but with the rate at which instructions are retired? Or does it have to with the depth of the instruction pipeline? As long as there's sufficient occupancy, a latency difference of a few clock cycles shouldn't mean anything in the context of a thousand clock cycle latency for accessing DRAM?

1 comments

dragontamer 1366 days ago

> thousand clock cycle latency for accessing DRAM?

That's what's faster.

Vega64 accesses HBM in like 500 nanoseconds. (https://www.reddit.com/r/ROCm/comments/iy2rfw/752_clock_tick...)

RDNA2 accesses GDDR6 in like 200 nanoseconds. (https://www.techpowerup.com/281178/gpu-memory-latency-tested...)

EDIT: So it looks like my memory was bad. I could have sworn RDNA2 was faster (Maybe I was thinking of the faster L1/L2 caches of RDNA?) Either way, its clear that Vega/GCN has much, much worse memory latency. I've updated the numbers above and also edited this post a few times as I looked stuff up.

link

TomVDB 1366 days ago

Thanks for that.

The weird part is that this latency difference has to be due to a terrible MC design by AMD, because there's not a huge difference in latency between any of the current DRAM technologies: the interface between HBM and GDDR (and regular DDR) is different, but the underlying method of accessing the data is similar enough for the access latency to be very similar as well.

link

dragontamer 1366 days ago

Or... supercomputer users don't care about latency in GCN/CDNA applications.

500ns to access main memory, and lol 120 nanoseconds to access L1 cache is pretty awful. CPUs can access RAM in less latency than Vega/GCN can access L1 cache. Indeed, RDNA's main-memory access is approaching Vega/GCN's L2 latency.

----------

This has to be an explicit design decision on behalf of AMD's team to push GFLOPS higher and higher. But as I stated earlier: video game programmers want faster latency on their shaders. "More like NVidia", as you put it.

Seemingly, the supercomputer market is willing to put up with these bad latency scores.

link

TomVDB 1366 days ago

But why would game programmers care about shader core latency??? I seriously don't understand.

We're not talking here about the latency that gamers care about, the one that's measured in milliseconds.

I've never seen any literature that complained about load/store access latency in the shader core. It's just so low level...

link

dragontamer 1366 days ago

> But why would game programmers care about shader core latency??? I seriously don't understand.

Well, I don't know per se. What I can say is that the various improvements AMD made to RDNA did the following:

1. Barely increased TFLOPs -- Especially compared to CDNA, it is clear that RDNA has fewer FLOPs

2. Despite #1, improved gaming performance dramatically

--------

When we look at RDNA, we can see that many, many latency numbers improved (though throughput numbers, like TFLOPs, aren't that much better than Vega 7). Its clear that the RDNA team did some kind of analysis into the kinds of shaders that are used by video game programmers, and tailored RDNA to match them better.

> I've never seen any literature that complained about load/store access latency in the shader core. It's just so low level...

Those are just things I've noticed about the RDNA architecture. Maybe I'm latching onto the wrong things here, but... its clear that RDNA was aimed at the gaming workload.

Perhaps modern shaders are no longer just brute-force vertex/pixel style shaders, but are instead doing far more complex things. These more complicated shaders could be more latency bound rather than TFLOPs bound.

link