Hacker News new | ask | show | jobs
by DarkShikari 5318 days ago
While the 8-module chip does share a few things (mainly a vector processing unit

And more importantly, the decode and dispatch unit, which only run every other clock for a given core -- thus limiting any given core to a theoretical maximum of a mere 2 IPC, and in practice a lot more than that since the dispatch unit has limitations of its own, nevermind branch mispredictions and such.

2 comments

My understanding is that it can give every cycle to a given thread just as long as the other thread doesn't need it to decode anything (if it's idle or whatever). i.e. it can give one thread 4 ops/cycle sustained, given the right workload. But for your purposes, that's probably not any improvement.
i.e. it can give one thread 4 ops/cycle sustained, given the right workload.

Dispatch can only do 2 loads per cycle, and 1 store per cycle. Any more, and it stops on that instruction and dispatches nothing more for that cycle. On plenty of workloads, especially typical compiler output for C code, this is not going to nearly reach the 4 ops/cycle maximum, even on a single thread.

I'm not sure what it would look like in the video [d]ecoder world, but I don't think that would matter since most the time you'd want to use the 256-bit vector instructions (in practice this would hardly be a high priority until they're nearly ubiquitous...). For use cases where you are addressing large memory regions this hardly seems like that big of a deal. There are times when you can schedule tons of calculations without leaving L1 but for some odd reason people are finding 500GB+ of RAM useful.
since most the time you'd want to use the 256-bit vector instructions

There are no 256-bit integer vector instructions on x86, and AVX is slower than SSE on Bulldozer.

Sad but true...You can issue SIMD instructions on 4 doubles at once though (and put whatever you want in those 16 registers)....