|
|
|
|
|
by paulmd
4156 days ago
|
|
It's somewhat misleading to look at an arithmetic average of the bandwidth of the fast/slow segments. Due to the way they architectured it, you cannot access both the fast segment and the slow segment during the same memory-fetch cycle, it's either/or. If control flow depends on data that's stuck in the slow segment, performance could be significantly degraded. Now - blah blah prediction, blah blah heuristics, yadda yadda. If you don't use the memory fully (compute, 4K, etc) there's no problem, and even then you can optimize the problem away somewhat. This will work pretty well for AAA-grade game engines that get special attention - Unreal, CryEngine, Unity. But for memory-bound (especially latency-sensitive) compute applications, what you have here is a 3.5GB card, not a 4GB card. Having a card show up with 1/8th of its specced memory units turned off is not acceptable, regardless. |
|
Looking at it on a 2-cycle basis, since performance is 7x as high you can either access (7+7) or (7+1) chunks of memory. That's a 43% performance drop if even 1 of the 32 threads in a warp consistently needs to touch the slow segment.
That data being used for control flow will amplify the problem, of course, since latency will double.