| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by john567 1470 days ago
	This thing branched horribly. You had to use instructions to selectively write to distinct memory locations to avoid typical branching because misprediction was expensive. This was Frostbite/Battlefield 3 era. Good Times.

2 comments

wanderingmoose 1469 days ago

Here's a good explanation of some of the issues w/ common programming patterns at the time colliding with the 360 design. This was a huge problem with the initial ports of older game engines to the 360 that required a lot of rewrites to achieve ok performance.

https://www.gamasutra.com/view/feature/3687/sponsored_featur...

bluedino 1470 days ago

Is that a POWER thing or what was the main cause?

corysama 1470 days ago

On top of being power architecture, both the 360 and the PS3 chose to push the 4 GHz clock limit before everyone else. To do this they sacrificed lots of speculative and out of order execution features of the CPU. The thinking of the hardware engineers was that the software is compiled for a fixed architecture doing a fixed job so the compiler should be able to statically order the instructions to make the best use of the very long, very static, very linear pipeline of the CPU. In practice, that only makes sense for very long stretches of instructions with no branches and no cache misses —which is completely the opposite of every piece of gameplay code ever written. The CPUs were great at large scale linear algebra. Not great at much else.

bombcar 1470 days ago

The "compiler will solve everything" theory seemed to last quite a long time, I first heard it with the Itanium and again with these pipelined CPUs.

Narrator: It didn't solve anything.

bee_rider 1470 days ago

This is speculation, but: optimizing compilers are pretty good, right? On x86 at least.

Perhaps they do a good job on popular platforms like x86, because we can encode decades of experience, but not so great on brand new ones.

Veliladon 1470 days ago

One thing that Intel and AMD do better than any other player in the industry is branch prediction. An absolutely stupifying amount of die area is dedicated to it on x86. Combining this with massive speculative execution resources and you can get decent ILP even out of code that's ridiculously hostile to ILP.

Our modern CPU cores have hundreds of instructions in flight at any one moment because of the depth of OoO execution they go to. You can only go that deep on OoO if you have the branch prediction accurate enough not to choke it.

colejohnson66 1469 days ago

> An absolutely stupifying amount of die area is dedicated to it on x86.

Yep. For example, on this die shot of a Skylake-X core,[0] you can see the branch predictor is about the same area as a single vector execution port (about 8% of the non-cache area).

[0]: https://twitter.com/GPUsAreMagic/status/1256866465577394181

delta_p_delta_x 1467 days ago

> One thing that Intel and AMD do better than any other player in the industry is branch prediction. An absolutely stupifying amount of die area is dedicated to it on x86.

Zen in particular combines an L1 perceptron and L2 TAGE[0] predictor[1]. TAGE in particular requires an immense amount of silicon, but it has something like 99.7% prediction accuracy, which is... crazy. The perceptron predictor is almost as good: 99.5%.

I wrote a software TAGE predictor, but too bad it didn't perform as well as predicted (heh) by the authors of the paper.

[0]: https://doi.org/10.1145/2155620.2155635 [1]: https://fuse.wikichip.org/news/2458/a-look-at-the-amd-zen-2-...

fulafel 1469 days ago

Everything is relative. They do things that seemed quite neat in the 90s, but then progress slowed to a crawl.

I'd call the state of the field quite bad. For example they do embarrassingly little for you to help with the 2 main bottlenecks we've had for a long time: concurrency and data layout optimization. And for even naive model (1 cpu / free memory) there is just so much potentially automateable manual toil in doing semantics based transformations in perf work that it's not even funny.

A large part is using languages that don't support these kinds of optimizations. It's not "C compiler improvements hit a wall", it continues "and we didn't develop & migrate to languages whose semantics allow optimizations". (There's a little of this in the GPU world, but the proprietary infighting there has produced a dev experience and app platform so bad that very few apps outside games venture there)

There's a whole alternative path of processor history not taken in the case that VLIW had panned out, and instead of failing because of optimism about compiler optimizations.

mhh__ 1470 days ago

They do a good job but the scheduling aspects are really really fuzzy.

LLVM and GCC both have models of out of order pipelines but other than making sure they don't stall the instruction decoder it's really hazy whether they actually do anything.

The processors themselves are designed around the code the compilers emit and vice versa.

astrange 1469 days ago

Optimizing compilers aren't that great on x86. Sure, they're good enough to make something 60fps that wasn't before, but they don't really have much specific x86 knowledge.

wmf 1470 days ago

Nah, in-order just can't be fixed by any compiler.

zenron 1470 days ago

As said in other comments, not a power thing. Example, the Nintendo Wii U chose to use 3 1.2ghz (iirc) Out of Order execution cpu's that were more like a PC than the cores in the Xbox 360 even though sounding similar in instruction set and composition.

This lead the Wii U to be able to do things like Run Mass Effect 3 and Deus Ex better (arguably) than the PS3 and 360 most of the time. The Wii U was probably the better hardware platform in hindsight but it came too late and the development tools were not as robust so ports just kinda afterthoughts.

wmf 1470 days ago

You're comparing the 2012 Wii U against 2005/2006 360/PS3. The PPE was indeed terrible but it's not clear that IBM had anything better available at that time.

kmeisthax 1469 days ago

The Wii U also needed to be backwards-compatible with the Wii, which used the bespoke paired singles and locked cache line features of the GameCube's PPC 750 derivative. This almost certainly locked them out of newer PowerPC designs without more engineering work than Nintendo would be willing to put into its systems.

For context, Nintendo has always been weirdly quirky and low-buck when it comes to core silicon engineering. The Switch is a Tegra X1 in a trenchcoat, the SNES used a 65C816 at about half the clockspeed it needed to be[0] and had half the VRAM removed at the last minute, and the NES stole[1] the 6502 masks so they didn't have to pay MOS for legit chips. All of those design decisions were made purely to improve margins and genuinely constrained game developers in the process. "Lateral thinking with withered technology" is kind of just their thing.

At least now they're 100% on board with a silicon vendor with a sane roadmap, so they'll at least have a steady supply of backwards-compatible last-gen chips to repackage.

[0] At least it wasn't as slow as the Apple IIgs they pulled it from

[1] Technically legal as IC maskwork rights did not exist yet. This is also why decimal mode was removed - it was literally the only thing MOS had a patent on in the 6502 design.

TapamN 1469 days ago

>had half the VRAM removed at the last minute

I've never heard anything suggesting that video RAM was removed. AFAIK, the SNES was planned to have only 8KB of main RAM, which was increased to 128KB by release. I think any support for 128KB VRAM was for future proofing, like if the SNES's hardware was reused for arcade systems, or something.

Source: https://www-chrismcovell-com.translate.goog/secret/sp_sfcpro...

The Genesis's video chip can support 128KB video RAM as well, which besides allowing a larger variety of tiles on screen and doubles DMA bandwidth. It was used in the System C2 arcade board. The Genesis was originally designed to use 64KB video RAM, but after hearing about the SNES, support for 128KB was added. Then they decided that the extra RAM didn't make enough of a difference to justify the cost, so they left it at 64KB.

Source: https://readonlymemory.vg/shop/book/sega-mega-drive-genesis-...

pinewurst 1469 days ago

Ricoh not Nintendo ripped off MOS IP.

djmips 1469 days ago

Nintendo shipped it and I'm sure they knew.

anyfoo 1469 days ago

Super interesting comment.

> which used the bespoke paired singles and locked cache line features of the GameCube's PPC 750 derivative

Where can I find some information on that specifically?

wmf 1469 days ago

http://datasheets.chipdb.org/IBM/PowerPC/Gekko/gekko_user_ma...

Veliladon 1470 days ago

It wasn't a POWER thing since POWER has been superscalar since the 604 back in '94. This was because when you're designing for a console and paying by the wafer instead of shopping around for a pre-finished unit one needs to consider die area more strictly. Something had to give in the design and given MS controlled the design of the whole stack they thought they could do parallelism at the thread level in the OS scheduler rather than having to devote massive amounts of die area to it.

monocasa 1470 days ago

The xenon core was a slightly modified cell ppe, where there it was focused on saving gates for the cores needed for system management versus more spus.

The idea using it in xenon without the spus was that high perf code could be tailored specifically for this core's uarch being a console, so the in order nature wasn't the worst thing and the gate savings are pretty huge.