| HN Mirror

> Footnote: I am convinced that we'll see a fork (of sorts) of RISC-V for the GBOoO segment.

Yeah, seems likely.

Qualcomm has been trying to push RISC-V to be better for GBOoO after ARM fucked them over with their Nuvia purchase. They have a high performance AArch64 core and no AArch64 licence for it.

They have been pushing to drop the compressed 16bit instruction extension from the core profile, and proposed a new extension improves code density by adding new addressing modes stolen from AArch64.

> For instance, I consider MRISC32 to be a RISC ISA, but an implementation may expand vector instructions into multiple operations that take several clock cycles to complete

ARM Inc takes this approach for vector instructions on their little cores (like the A53).

> I still think that there is potential in software-aided decoding/translation (heck, most of the software that we run on a daily basis is JIT-translated, so it can't be that bad).

Ironically, the prevalence of JITs in modern software is one of the major reasons why Project Denver had a hard time. It took a noticeable performance hit worse when executing JITTed code, not that it's performance on static code was great. This is despite the fact that Denver had a hardware translator so it didn't have to send all code though the software translator.

I suspect Transmeta fell into the classic trap of underestimating just how good of a performance advantage that a GBOoO gets from hiding the latency of memory ops with out-of-order execution. With hindsight, we know know that advantage is massive, but nobody really knew about it 20 years ago.

I'm not entirely sure what Denver's problem was. I understand they did aggressive memory prefetching to try and compensate. Maybe that just wasn't good enough. Or maybe it was just translation overhead issues, trying to schedule VILW code is a hard problem, and the same reason why Itanium failed.

> However... If software decoding is only used for small power efficient cores (and maybe they use something else than VLIW?

Yeah, might work. Well, more for a medium sized core than small.

I'm thinking maybe if you kept the instruction bundling from VLIW so your frontend is significantly simpler, but still use an out-of-order backend so you get the latency hiding advantage. And because it's only an efficiency on a heterogeneous SoC, the OS can identify code (or processes) that doesn't work well with software decoding and kick it to the performance cores.

> IIRC a main driver for Transmeta was to circumvent x86 licensing issues

But then Nvidia tried the same approach. Apparently there was a lawsuit which they lost, and project Denver was repurposed as an AArch64 core. It might have been a good product if it could run x86 code.