| > x86 is a lousy architecture, but x86-64 isn't as bad; at least it has a good number of registers unlike x86. Funny thing. Whether you boot a modern x86 system in 32 bit mode or 64 bit mode doesn't change the number of registers you're using. You're still using the 32-128 physical registers on the core. That's why x86_64 code isn't particularly faster (sometimes slower) than the same code compiled for x86_32 mode. People have this zany idea that assembly language is a low level language. It's not. When the CPU executes 32 bit x86 code it "compiles" it to uops that are totally unrecognizable to us and use dozens to hundreds of registers. The thing about embedded kinda exposes your mental bias. When you scale up the x86 instruction decoder from Intel Atom scale to Xeon scale, the instruction decoder gets a little bit more complicated, but it's given a ton more tools to use. So sure, Atom sucks at embedded, but x86 is still king at desktop and beyond, and ARM will never be able to challenge it. If ARM wanted to challenge x86 in terms of single threaded performance, it would need to do the same thing x86 does: have a super complicated instruction decoder that maps the 32 logical registers defined by the ISA to its 128 physical ones, reschedules everything, renames stuff where appropriate, identify loads that can be elided, etc. And all of the advantages of having a simple ISA go out the window, because the ISA is an illusion. Unfortunately Intel's Architecture Code Analyzer is dead. LLVM MCA is almost as good. I recommend you play around with it a bit some time. CPUs kinda don't give a crap if you're speculating four loop iterations ahead and you're just using the same few registers over and over again for multiple purposes. |
Yes, but the x86 code itself can't address hundreds of registers, only a handful, because it assumes that's all there is (because that's all there was in the actual x86 processors way back). So the fact that you're using a complicated instruction decoder to get around this and make use of much more capable hardware underneath seems to me to be a big source of inefficiency: surely you would have more performance if your ISA could directly use the hardware resources, instead of needing a super complicated instruction decoder.
>So sure, Atom sucks at embedded, but x86 is still king at desktop and beyond, and ARM will never be able to challenge it.
ARM is already challenging it. They have ARM-64 servers now. Here's a place selling them, from a quick Google search: https://system76.com/servers/starling
>If ARM wanted to challenge x86 in terms of single threaded performance, it would need to do the same thing x86 does: have a super complicated instruction decoder that maps the 32 logical registers defined by the ISA to its 128 physical ones, reschedules everything, renames stuff where appropriate, identify loads that can be elided, etc.
Ok, then why not just make a new (or at least extended) ISA that makes direct use of all those things, instead of needing a super complicated instruction decoder? We already have lots of different ISAs for embedded CPUS: ARM has all kinds of variants (ARMv7, ARMv9, etc.), and MIPS does too. For best performance, you have to compile for the exact ISA you're targeting. We don't do this for desktop stuff mainly because Microsoft isn't going to make 30 different versions of Windows, but for embedded systems it's perfectly normal because everything is compiled from source.