Hacker News new | ask | show | jobs
by rickmode 2684 days ago
How about CPUs without speculative execution and simultaneous multithreading (SMT / Hyper-Threading, which has similar issues)? We would, of course, need other optimizations to claw back the performance loss--an engineering problem I feel we can solve.

I've wondered if the solution is more, simpler cores. We concentrate on smaller, faster cores, and the programming to utilize them better. Perhaps advances in memory architectures as well. Hardware isn't my specialty, so I'm just brainstorming here.

Perhaps this is where ARM and even RISC-V based systems can step in.

But I'm a software guy, so what do I know? I just know I'd feel more comfortable with systems based on simpler CPUs that just cannot be exploited by the recent side-channel attacks discovered, rather than trying playing whack-a-mole with patches, along with trying to reason when it might be safe to use CPUs with these optimizations.

2 comments

I don't know why this is being down-voted.

A few things: ARM and RISC-V definitely have specEx baked in (though you can not include SpecEx module on RISC-V). There are interesting alternatives to SpecEx. DSPs use delay slots, and I've seen delay slots used quite well in a GP-CPU. Getting high instruction saturation on a CPU with delay slots is a "hard compiler problem", but I have a few things to say about that:

Despite jokes about "better compilers", compilers are getting better (e.g. polyhedral optimization). One way to think of what OOOex/SpecEx is that it's figuratively the CPU JITting your code on the fly. The most popular programming language JITs aggresively anyways so one wonders if there isn't some reduplication going on.

Furthermore, the most popular programming language isn't entirely the most raw-power performant, and it's pretty clear that in our current ecosystem just pushing operations through the FPU (which is what x86 optimizes for) isn't necessarily the most important thing in the world; uptime, reliability, fault-tolerance, safe paralellization, distribution, and power conservation might be more important moving forward.

HM, oops, apparently RISC-V has OOOEx, not SpecEx.

I understand this is nitpicking, but it's not accurate to say "RISC-V has speculative execution" or "ARM has OoO execution" and that they therefore suffer from spectre and friends.

RISC-V/ARM are specifications of instruction sets, for which there exists an enormous domain of possible implementations. Spectre/Meltdown are not inherent features of Instruction set architectures. They are emergent properties of certain implementations of those instruction set architectures.

For example, the BOOM implementation of RISC-V does out of order execution. The Rocket chip implementation does not. Both implement the RISC-V architecture.

I'm not replying to you specifically. But I see this sort of thing on HN all the time and I feel like it's an important distinction to make.

Thank you, I should have been more careful. Spectre and meltdown are in fact specific interactions that happen because OOO and specex are hard and it's easy to mess up given the high level of statefulness and complexity in contemporary chip designs (in this case - memory caching). But ooo and specex make chip architectures difficult to reason about and I'm sure more errors will emerge.
> Despite jokes about "better compilers", compilers are getting better

The compiler has to make static decisions. The hardware knows what is actually happening. There is an inherent information asymmetry at work that a "sufficiently smart" compiler seems unlikely to overcome.

My intuition says software can't beat the speed of a superscalar OOO CPU anymore than a GP CPU can beat a roughly equivalent DSP for algorithms suitable to run on the DSP, but I have no proof for that.

I'll also note that we've been promised "smarter compilers" for decades. Intel has tried that route several times. No one has ever made it work.

> The compiler has to make static decisions.

Pretty sure I mentioned JITting in my comment.

> My intuition says software can't beat the speed of a superscalar OOO CPU anymore

How good is good enough? I mean we have distributed tensor flow which is basically on the fly compilation that can reorganize your computational graph around nodes with gpus separated by network latency, or Julia where you can drop in a GPUarray as a datatype and move computation to the GPU without changing your code.

If we go to something a bit more baroque, java is within 1.5 of c/c++ these days

Could you hand roll a better solution? Probably. Would it be worth it? Doubtful.

I think it's definitely worth exploring this angle because modern JIT compilers have become very advanced, and there's still a lot of juice left to squeeze there. Look at some of the things Graal is doing and it looks a lot like what OOO speculation is doing - it'll recompile branches on the fly based on profiling information and things like that.
Nvidia Denver couples a software based jit/translator with an inorder VLIW backend. It is vulnerable to spectre.
It is a natural alternative. The "simpler but more cores" project works on paper (i.e. potential instruction throughput). In reality it falls apart for variety of reasons. The most fundamental is because of the difficulty of exploiting thread-level parallelism. The complex Out-of-Order cores do a really good job of improving throughput by finding independent instructions to execute in parallel. The path-to-parallelism is much easier at the granularity of instructions than at the granularity of cores. Parallel programming is hard. Amadahls law cannot be avoided except through ----- speculation, so we are back to complexity again.
> In reality it falls apart for variety of reasons.

On the contrary, the "many wimpy cores" approach has been very successful on GPUs (and other vector processors such as TPUs, etc.) What it hasn't been successful at is running existing software unmodified.

The best solution (the one we use today, in fact) seems to be a hybrid approach. We have a few powerful cores to run sequential code and many weaker units (often vector units; e.g. GPUs and SIMD) to run parallel code. Not all algorithms can be parallelized, so we'll likely always need a few fast sequential cores. But a lot of code can.

Amdahl's law for sure. We all know most software sucks, so in this vision (simpler cores), that reduces us to taking the performance hit and optimizing elsewhere. For example, older ARM-based systems do not use speculative execution, and can handle encryption and video transcoding though co-processors.