Hacker News new | ask | show | jobs
by GregarianChild 377 days ago
Modern GPU instructions are often VLIW and the compiler has to do a lot to schedule them. For example, Nvidia's Volta (from 2017) uses 128-bit to encode each instruction. According to [1], the 128 bits in a word are used as follows:

• at least 91 bits are used to encode the instruction

• at least 23 bits are used to encode control information associated to multiple instructions

• the remaining 14 bits appeared to be unused

AMD GPUs are similar, I believe. VLIW is good for instruction density. VLIW was unsuccessful in CPUs like Itanium because the compiler was expected to handle (unpredictable) memory access latency. This is not possible, even today, for largely sequential workloads. But GPUs typically run highly parallel workload (e.g. MatMul), and the dynamic scheduler can just 'swap out' threads that wait for memory loads. Your GPU will also perform terribly on highly sequential workloads.

[1] Z. Jia, M. Maggioni, B. Staiger, D. P. Scarpazza, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. https://arxiv.org/abs/1804.06826

1 comments

Personally, I have a soft spot for VLIW/EPIC architectures, and I really wish they were more successful in the mainstream computing.

I didn't consider GPU's precisely for the reason you mentioned – because of their unsuitability to run sequential workloads, which is most applications that end users run, even though nearly every modern computing contraption in existence has them today.

One, most assuredly, radical departure from the von Neumann architecture that I completely forgot about is the dataflow CPU architecture, which is vastly different from what we have been using in the last 60+ years. Even though there have been no productionised general purpose dataflow CPU's, it has been successfully implemented for niche applications, mostly in the networking. So, circling back to the original point raised, dataflow CPU instructions would certainly qualify for a new design.

The reason that VLIW/EPIC architectures have not been successful that for mainstream workloads is the combination of

• the "memory wall",

• the static unpredictability of memory access, and

• the lack of sufficient parallelism for masking latency.

Those make dynamically scheduling instructions is just much more efficient.

Dataflow has been tried many many many times for general-purposed workloads. And every time it failed for general-purposed workloads. In the early 2020s I was part of an expensive team doing a blank-slate dataflow architecture for a large semi company: the project got cancelled b/c the performance figures were weak relative to the complexity of micro-architecture, which was high (hence expensive verification and high area). As one of my colleagues on that team says: "Everybody wants to work on dataflow until he works on dataflow." Regarding history of dataflow architectures, [1] is from 1975, so half a century old this year.

[1] J. Dennis, A Preliminary Architecture for a Basic Data-Flow Processor https://courses.cs.washington.edu/courses/cse548/11au/Dennis...