This thing branched horribly. You had to use instructions to selectively write to distinct memory locations to avoid typical branching because misprediction was expensive.
Here's a good explanation of some of the issues w/ common programming patterns at the time colliding with the 360 design. This was a huge problem with the initial ports of older game engines to the 360 that required a lot of rewrites to achieve ok performance.
On top of being power architecture, both the 360 and the PS3 chose to push the 4 GHz clock limit before everyone else. To do this they sacrificed lots of speculative and out of order execution features of the CPU. The thinking of the hardware engineers was that the software is compiled for a fixed architecture doing a fixed job so the compiler should be able to statically order the instructions to make the best use of the very long, very static, very linear pipeline of the CPU. In practice, that only makes sense for very long stretches of instructions with no branches and no cache misses —which is completely the opposite of every piece of gameplay code ever written. The CPUs were great at large scale linear algebra. Not great at much else.
One thing that Intel and AMD do better than any other player in the industry is branch prediction. An absolutely stupifying amount of die area is dedicated to it on x86. Combining this with massive speculative execution resources and you can get decent ILP even out of code that's ridiculously hostile to ILP.
Our modern CPU cores have hundreds of instructions in flight at any one moment because of the depth of OoO execution they go to. You can only go that deep on OoO if you have the branch prediction accurate enough not to choke it.
> An absolutely stupifying amount of die area is dedicated to it on x86.
Yep. For example, on this die shot of a Skylake-X core,[0] you can see the branch predictor is about the same area as a single vector execution port (about 8% of the non-cache area).
> One thing that Intel and AMD do better than any other player in the industry is branch prediction. An absolutely stupifying amount of die area is dedicated to it on x86.
Zen in particular combines an L1 perceptron and L2 TAGE[0] predictor[1]. TAGE in particular requires an immense amount of silicon, but it has something like 99.7% prediction accuracy, which is... crazy. The perceptron predictor is almost as good: 99.5%.
I wrote a software TAGE predictor, but too bad it didn't perform as well as predicted (heh) by the authors of the paper.
Everything is relative. They do things that seemed quite neat in the 90s, but then progress slowed to a crawl.
I'd call the state of the field quite bad. For example they do embarrassingly little for you to help with the 2 main bottlenecks we've had for a long time: concurrency and data layout optimization. And for even naive model (1 cpu / free memory) there is just so much potentially automateable manual toil in doing semantics based transformations in perf work that it's not even funny.
A large part is using languages that don't support these kinds of optimizations. It's not "C compiler improvements hit a wall", it continues "and we didn't develop & migrate to languages whose semantics allow optimizations". (There's a little of this in the GPU world, but the proprietary infighting there has produced a dev experience and app platform so bad that very few apps outside games venture there)
There's a whole alternative path of processor history not taken in the case that VLIW had panned out, and instead of failing because of optimism about compiler optimizations.
They do a good job but the scheduling aspects are really really fuzzy.
LLVM and GCC both have models of out of order pipelines but other than making sure they don't stall the instruction decoder it's really hazy whether they actually do anything.
The processors themselves are designed around the code the compilers emit and vice versa.
Optimizing compilers aren't that great on x86. Sure, they're good enough to make something 60fps that wasn't before, but they don't really have much specific x86 knowledge.
As said in other comments, not a power thing. Example, the Nintendo Wii U chose to use 3 1.2ghz (iirc) Out of Order execution cpu's that were more like a PC than the cores in the Xbox 360 even though sounding similar in instruction set and composition.
This lead the Wii U to be able to do things like Run Mass Effect 3 and Deus Ex better (arguably) than the PS3 and 360 most of the time. The Wii U was probably the better hardware platform in hindsight but it came too late and the development tools were not as robust so ports just kinda afterthoughts.
You're comparing the 2012 Wii U against 2005/2006 360/PS3. The PPE was indeed terrible but it's not clear that IBM had anything better available at that time.
The Wii U also needed to be backwards-compatible with the Wii, which used the bespoke paired singles and locked cache line features of the GameCube's PPC 750 derivative. This almost certainly locked them out of newer PowerPC designs without more engineering work than Nintendo would be willing to put into its systems.
For context, Nintendo has always been weirdly quirky and low-buck when it comes to core silicon engineering. The Switch is a Tegra X1 in a trenchcoat, the SNES used a 65C816 at about half the clockspeed it needed to be[0] and had half the VRAM removed at the last minute, and the NES stole[1] the 6502 masks so they didn't have to pay MOS for legit chips. All of those design decisions were made purely to improve margins and genuinely constrained game developers in the process. "Lateral thinking with withered technology" is kind of just their thing.
At least now they're 100% on board with a silicon vendor with a sane roadmap, so they'll at least have a steady supply of backwards-compatible last-gen chips to repackage.
[0] At least it wasn't as slow as the Apple IIgs they pulled it from
[1] Technically legal as IC maskwork rights did not exist yet. This is also why decimal mode was removed - it was literally the only thing MOS had a patent on in the 6502 design.
I've never heard anything suggesting that video RAM was removed. AFAIK, the SNES was planned to have only 8KB of main RAM, which was increased to 128KB by release. I think any support for 128KB VRAM was for future proofing, like if the SNES's hardware was reused for arcade systems, or something.
The Genesis's video chip can support 128KB video RAM as well, which besides allowing a larger variety of tiles on screen and doubles DMA bandwidth. It was used in the System C2 arcade board. The Genesis was originally designed to use 64KB video RAM, but after hearing about the SNES, support for 128KB was added. Then they decided that the extra RAM didn't make enough of a difference to justify the cost, so they left it at 64KB.
It wasn't a POWER thing since POWER has been superscalar since the 604 back in '94. This was because when you're designing for a console and paying by the wafer instead of shopping around for a pre-finished unit one needs to consider die area more strictly. Something had to give in the design and given MS controlled the design of the whole stack they thought they could do parallelism at the thread level in the OS scheduler rather than having to devote massive amounts of die area to it.
The xenon core was a slightly modified cell ppe, where there it was focused on saving gates for the cores needed for system management versus more spus.
The idea using it in xenon without the spus was that high perf code could be tailored specifically for this core's uarch being a console, so the in order nature wasn't the worst thing and the gate savings are pretty huge.
https://www.gamasutra.com/view/feature/3687/sponsored_featur...