| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Tuna-Fish 2999 days ago

Processing in memory has real promise for the cases where your work can be distributed. Specifically, I think it can have a great future in AI. However, for general purpose code I doubt it can do anything. Your example of indirect load would be greatly sped up if the target of the pointer is on the same device as the pointer. However, the second it isn't, the speed of moving things from one ram chip to another isn't any faster than from ram chip to cpu, and at that point defining a single central location that tries to be close to everything just makes sense. If your operation needs 8 values from 8 different places, having a central location means doing 8 transfers, while PIM can mean forwarding each value/intermediate values multiple times to go the the next location.

None of the changes to x86 people have thought of over the years really helps enough to break backcompat. Simply because they aren't on the fast path on the critical execution stage. The limit imposed on frequency by power in current cpus is not really the total amount of power consumed, it's the amount of power consumed in the <0.25mm of chip that houses the register file, forwarding network and alus. That is, the place were things actually happen during the most important pipeline stage. This is why a 8-core cpu running just a single thread cannot make one of the cores consume as much power as all the 8 would if running 8 threads -- the register file of the running core would just melt, even if the total power would stay below chip limits.

x86 decoding is hairy and takes a long time and a lot of transistors. However, it is placed in it's own pipeline stages, that are ran parallel to the execute and only slow it down by making a branch miss a little more expensive. And the power is limited today by caching the decoded uops in their own cache, so during any tight loop, the decode hardware is idle and consumes no power. The same sort of goes for the stack engine -- as it runs early in the pipeline, it is basically a way to compress instructions a little that saves power by making code more compact when it is running, and does nothing when it is not used. Removing it would not really help, even if all code instantly changed to accommodate. Much of the rest of the ugly warts of the x86 architecture is handled in the time-honored CISC way: just punt it to microcode, performance be damned. Today, self-modifying code technically works, but you never want to do it because invalidating lines in the L1i has been implemented in the way that is the fastest and cheapest way to make the common case of code that does not modify itself. (And which has to exists even if you don't support self-modifying code, because there has to be some way of invalidating L1i entries.) Similarly, a lot of the CISC instructions that make more sense to implement as software routines (fpu sin/cos for example) are today just abandoned ucode routines that are slower than rolling your own.

1 comments

Tobba_ 2999 days ago

I'm not talking about the fundamentally misguided memory-distributed computing stuff, I mean "improve flexibility enough that you can bolt some additional units on as offload" (address translation in this case would take some work though). The magic of presenting software with a more or less monolithic core in this case is that you don't have that problem, since you can simply do it the usual way.

Also, I don't think the trouble with added complexity out of the hot path is any added latency, it's that they're needlessly burning up the thermal budget. Not that raising the voltage is the best way of increasing frequency, but it's sure to do so.

link