Hacker News new | ask | show | jobs
The Worst CPUs Ever Made (2021) (extremetech.com)
88 points by fzliu 1490 days ago
20 comments

This doesn't seem to be the best-researched article out there.

If they thought Itanium was bad, they should have looked into the i860. Itanium was an attempt to fix a bunch of the i860 ideas. i860 quickly went from a supercomputer chip to a cheap DSP alternative (where it had at least the hope of hitting more than 10% of its theoretical performance).

Intel iAPX 432 was preached as the second coming back in the 80s, but failed spectacularly. The i960 was take 2 and their joint venture called BiiN also shuttered. Maybe Rekursiv would be worthy of a mention here too.

We now know that core 2 dropped all kinds of safety features resulting in the Meltdown vulnerabilities. It also partially explains why AMD couldn't keep up as these shortcuts gave a big advantage (though security papers at the time predicted that meltdown-style attacks existed due to the changes).

Rather than an "honorable mention", the Cell processor should have easily topped the list of designs they mentioned. It was terrible in the PS3 (with few games if any able to make full use of it) and it was terrible in the couple supercomputers that got stuck with it.

I'd also note that Bulldozer is also maligned more than it should be. There's a lot to like about the concept of CMT and for the price, they weren't the worst. I'd even go so far as to say that if AMD wasn't so starved for R&D money during that period, they may have been able to make it work. ARM's latest A510 shares more than a few similarities. A big/little or big/little/little CMT architecture seems like a very interesting approach to explore in the future.

I was also surprised the iAPX 432 wasn't on the list. It seems to be the Itanium's grandaddy. It was expensive, targeted to enterprises rather than everyone, tried to push the boundaries (32-bit for the 432, 64-bit for Itanium), and relied on VLIW instruction sets that were beyond the capabilities of compilers. The resemblance is striking.

As for Bulldozer, I was saddled with one for a while. Where it really fell down was (surprise!) its floating point performance. That FPU shared between two integer units makes for some "interesting" performance characteristics when trying to run multiple FP-heavy tasks, but overall, it was merely mediocre rather than terrible. I'm glad AMD hit it out of the park with Zen.

Bulldozer gets too much hate IMO. Okay, the instructions per clock cycle were bad and power consumption was high but you can't forget that the FX-6300 was $100 for a >3-core chip that could be overclocked by another 0.7 GHz without issue. The price-performance ratio was better than anything Intel fielded. I'm still running it today.
Bulldozer has got a lot of hate mostly because of false advertising and because of a series of blog articles written by AMD marketing people before its launch in 2011, which created very wrong expectations about its characteristics.

The wrong expectations and false advertising have centered on the fact that the first Bulldozer was described as an 8-core CPU, which would easily crush its 4-core competition from Intel (Sandy Bridge).

What the AMD bloggers have forgotten to mention was that the new Bulldozer cores were much weaker than the cores of their previous CPU generations, being able to execute only 2 instructions per cycle, while an Intel core could execute 4 instructions per cycle (and the previous AMD cores could execute 3 instructions per cycle). So a Bulldozer core only had the performance of a single thread of the 2 threads of an Intel core, for multi-threaded tasks, with the additional disadvantage that the resources of 2 AMD cores could not be allocated to a single thread when the second core of a module was idle.

So an 8-core Bulldozer could barely match the multi-threaded performance of a 4-core Sandy Bridge, while being much slower on single-thread tasks.

If one would have known since the beginning that the Bulldozer cores had been intentionally designed to be much weaker than the old AMD cores and than the Intel cores, this would not have been a surprise and everybody for whom the price/performance ratio was more important than the performance would have been happy to buy Bulldozer CPUs.

However, after many months during which AMD claimed that their supposedly 8-core CPU will be better than any other CPU with less cores, there was a huge disappointment caused by the first tests after launch, which immediately revealed the pathetic performance of the new cores, which for single-thread tasks were much slower than the previous AMD CPUs.

So all the hate has been caused by the stupid actions of the AMD management and marketing, who lied continuously about Bulldozer, even if they should have thought that this is useless, because the independent benchmarks will reveal the truth immediately after launch.

To set correctly the expectations about Bulldozer vs. Sandy Bridge, what AMD called a 4-module 8-core CPU should have been called a 4-core 8-thread CPU, but which has dynamic allocation inside a core (module in AMD jargon) only for the FPU, while the integer resources are allocated statically. With this correct description there would have been no surprise about the behavior of Bulldozer.

A part of the hate is also due to some engineering decisions whose reasons are a mystery even now, because if you would have queried randomly a thousand of logic design engineers before 2011, all or almost all would have said that they are bad decisions, so it is hard to understand how they could be promoted and approved inside the AMD design teams.

For example, since the Opteron launch in 2003 and until Intel launched Sandy Bridge in 2011, the largest advantage in performance of the AMD CPUs was in the computations with large numbers, because the AMD CPUs could do integer multiplications much faster than the Intel CPUs.

The Intel designers have recognized that this is a problem, and during the 2006-2011 interval they have decreased every year the number of clock cycles required for operations like multiplications and divisions, so that Penryn began to approach the AMD throughput per clock cycle, Nehalem & Westmere matched the AMD throughput, while Sandy Bridge achieved a double throughput in comparison with the old AMD CPUs.

While Intel worked diligently to improve the performance of their cores, what did AMD do ?

Someone at AMD has decided for an unknown reason that there is no need for Bulldozer to keep their existing computational performance, but it is enough to have integer multipliers with a throughput equal to a half of their current throughput and equal to only a quarter of their Sandy Bridge competitor (Intel had announced much in advance, by more than a year before launch, that Sandy Bridge will double the integer multiplication throughput over Nehalem, and it was anyway an obvious trend of the evolution of their previous cores; so the higher performance of the competition could not have been a surprise for the AMD designers).

The downgraded integer multipliers have crippled the performance of the new AMD CPUs for certain applications where their previous CPUs had been the best, while enabling only a negligible reduction in the core area.

price-to-performance is the last resort of a company that has failed at taking the performance crown.

Nobody cuts prices more than they have to, but everyone adjusts prices to where they need to go to sell the product. Bulldozer was priced low because it was genuine garbage, it was actually slower than Phenom in a lot of cases (which blows the "it was about price to peformance!" thing out of the water - nobody regresses performance on purpose).

(and before people wind up about the obvious counterexample: Ryzen was priced low because a 1800X was genuinely a lot slower than a 5960X in productivity tasks due to latency and poor AVX performance, and got completely smoked in gaming. If they had tried to go head-to-head with Intel at $1000 pricing they wouldn't have sold anything because it would have been a far inferior package to what Intel offered, they had to cut prices by around half to make it a compelling offering. And even then it was not that appealing compared to, say, a 5820K.)

Companies need to make enough of a showing to attract consumers but if a company prices something super aggressively, there's often a catch. And that's bulldozer in a nutshell. Oh shit the product sucks. What can we charge for a mediocre "8-core" (sorta) that underperforms the 4-core i7? Offer it at i5 pricing and see if anyone bites. If they had managed to achieve good performance, they would have priced it appropriately.

(the other thing is - people prefer to make the comparison about the FX-8350, but that's not Bulldozer, that's Steamroller. Bulldozer was the FX-8150/FX-6350, which actually did outright regress performance vs a Phenom X6, and was priced relatively steeply due to "8 real cores". Bulldozer went up against Sandy Bridge, Steamroller was more of an Ivy Bridge/Haswell competitor, and that's where prices really started to drop. It isn't a huge difference but Intel was making some progress too in those days.)

Price chart: https://www.anandtech.com/show/4955/the-bulldozer-review-amd...

But as a consumer, all I really care about is price/perf (and maybe power and a few other variables). Far to much of the tech industry runs around talking about how great the top dog (this week) is because they bin, push the engineering margins and sell some golden chip that ends up being .0000001% of their product line for some crazy $$$$$.

During the early part of the bulldozer timeframe AMD could provide a competitive part of much of intel's lineup at a lesser cost. It was only at the end were they kept falling farther and farther behind that it was a problem. For a few years there, you could actually _SEE_ in intels pricing where AMD's top part was because there would be a bunch of parts all clustered below some number (say $200) and then there would be a big price jump between every part above that line.

And so AMD had a real problem when you went into the $RETAILER looking at a $600 laptop because while their laptop might have been better than the similarly priced intel, what you would hear is "amd sucks" and so people would actually pick the inferior product.

sure, buy what you want, and competition certainly brings down prices, I don't disagree.

But making a low-cost product was not what AMD set out to do at the outset, so that's not really a defense of the technical flaws in Bulldozer's design. Sure, when they realized it was a trainwreck, they cut prices. Everyone does that, though, and that wasn't plan A.

Nobody is going to go through the expense of R&D and design and tapeout and then just not sell the product because it sucks/"missed expectations". You adjust the price to wherever it needs to be to sell the product.

Even in laptop the bulldozer chips were way power-hungry (actually this matters a lot more than in desktop) and just not that good a performer.

It was Intel's CEO's job to smile and sell hyper-clocked 14nm chips going against TSMC 7nm and it was AMD's CEO's job to smile and sell bulldozers going up against sandy bridge. That's what officers of the company do, even when they know it's shit. You go to war with the army you have, not the one you want, and you go to market with the product you have, not the one you want.

Yeah, it's good, but the author forgets to mention some other bad chips from before the late 1990's

- The Intel i432 - too far ahead of its time, in Itanium for the 1980's. https://en.wikipedia.org/wiki/Intel_iAPX_432

- The TI CMS320 series of DSPs. So full of silicon bugs it hurt TI badly.

- The Transputer T9000 - very ambitious, but vapourware for so long it killed its parent company. https://en.wikipedia.org/wiki/Transputer#T9000

the Cell processor in the PS3 was not terrible in the PS3 and I doubt you ever worked on it. So talk about 'not the best-researched'. You can find many people singing it's praises, including me.
Haha! I've spent months tuning code to run on the Cell, and I despise that thing.

Sony gave you 6 of the 8 SPE cores to use (I think they reserved two, but it's been ages). They are indeed very fast, however, they have no cache coherent access to main RAM and only 256k of memory for each element. So, you have to meticulously write DMA scheduling code to keep them fed. If you're a simpleton like me, you double buffer your SPE memory, cutting in in half, so 128k to work with, 128k for paging into, and you hope to be done paging before it's needed. Latency to memory is on the order of 2,000 cycles to first byte, but then they arrive fast.

So, what you do is decompose your problem into data streams that can be cruched through, but in such a way that you minimize the need to randomly access much memory. It's often cheaper to recompute things locally than to fetch them from RAM. Random access into your RAM is pointless, so you have to marshal all your input into DMA buffers, do some work, marshal all your output into other DMA buffers, and send back to host CPU.

Anyhow, I got this working. Meshes were being skinned at very high rate, but it was very frustrating. The PPE was really slow, so you had to offload as much as you could to those SPE's. But hey, I may be complaining, but it sure beats dealing with the "Emotion Engine" on the PS2. I can tell you which emotion that engine brings up.

In the early years, the SPUs were not all functional due to the fabbing process. The ones that had all 8 functional ended up in servers, and the ones with about 6 ended up in PS3s. This still happens all the time with clock speeds, and core counts on modern processors today. I’m sure the fabrication process improved over time, but they disabled the 2 cores to maintain backwards compatibility.

Unrelated: Every time I’m reminded about Cell I’m reminded of the OtherOS fiasco. I purchased a PS3 for the processor solely and I was very upset when I only got a $2 check for it. I never cashed it.

Same for me. I was very angry that Sony got away with that.
I still have a launch-edition PS3 on firmware 1.01 that I got on launch day (wife and I were fortunate enough to be able to buy two and stash one). I've lost all kinds of stuff in moves and etc. since, but that thing will have to be pried from my cold dead hands.
Sounds like you learned a lot and are a better programmer from it. Pretty much everything hasn't changed and you can either hope Moore's law bails you out somehow or you can take what you have learned and apply that to the reality of whatever hardware you happen to be optimizing, CPU or GPU. Sorry it's painful but to eke out max performance it's going to be hard at times.
For every person singing it's praises, there are dozens of game developers who were singing with gladness when it was gone. The PS3 devs I've spoken with (you aside) universally hated the platform and spoke of how much more dev time it took to launch games on the platform to achieve mediocre results.

If the chip were so wonderful to work on, then it would still be in use today as the theoretical performance per area beats everything else by a wide margin.

Roadrunner was built in 2008. It would still be just barely off the top 500 list in 2021, but was decommissioned just FIVE years later in 2013. Its x86 replacement was already underway in 2010 TWO years after its launch.

I'm glad you got to work with the architecture you loved for so many years, but I think the rest of the world disagrees with your assessment.

It probably was spectacular once you knew how to work with it. Like the Atari Jaguar though, getting the performance needed out of such a highly parallel architecture took a lot of time and investment. With cross-platform games really taking off during that time, it was a strategic mistake IMO.
That's an enthralling tale, but perhaps you could share why you feel it deserved praise-singing to begin with, and also what titles you worked on, considering many developers were complaining about it when it was current console architecture, and you don't even need to do much of a Google Search to find people bitching about it.
>Google Search to find people bitching about it.

Seriously? You can find people bitching about anything on Google Search. The fact is most people just weren't prepared for multi-core data oriented programming in 2006.

List my titles: no you first

> You can find people bitching about anything on Google Search

And in this thread, you can find credible people with specific complaints about the Cell processor.

> List my titles: no you first

That’s unfortunate. If you’re not full of shit, how could anyone possibly know?

> List my titles: no you first

One of the cardinal rules of argumentation is that the burden of proof is upon the person making the claim.

You've made it. Now back it up.

I guess you must have been on a college debate team.
> You can find many people singing it's praises, including me.

Until today, I’ve never once seen someone “singing it’s praises” that’s actually written code for one. At best, they’d curse it under their breath while saying it had its benefits. Usually however it was a full throated rant about how bad the experience was.

It was surprisingly useful for some high performance computing niches. It was in a weird time. FPGAs were available but weren’t as performant as they are today. GPUs were around but not nearly as powerful or flexible given some workloads.
Every single one of them came out a better coder. They might have been dragged kicking and screaming to the multi-core but they would've had to get there in the end.
> Every single one of them came out a better coder.

Sure that may be true, but that does not mean they are singing its praises either.

Just look at this very post on HN where folks who’ve written code for it have commented on the experience, how many would you say are:

- singing its praises (you)

- cursing it under their breath while saying it had its benefits (few)

- full throated rant about how bad the experience was (few)

IDK - I guess most experienced low level coders hate computers, it doesn't matter what the CPU. People are lazy. I understand that it was hard but it doesn't make it a WORST CPU EVER MADE
Mod -1: Rude
Personally I can find something to like in most architectures.

Cell (for example) was an asymmetric/hybrid multicore CPU; Apple Silicon is perhaps a modern example of asymmetric performance vs. efficiency cores, and also features special-purpose accelerator cores such as the neural engine.

The 432 had capability-based addressing. Speed-over-security has had a good run, but with some disastrous consequences. We may be seeing the return of capabilities with CHERI/ARM.

The 960 was an early superscalar design, supported tag bits, and was also a successful product.

Obligatory mention:

"RISC instruction sets I have known and disliked."

https://www.jwhitham.org//2016/02/risc-instruction-sets-i-ha...

https://news.ycombinator.com/item?id=11607119

I might also say that Sun's UltraSPARC was constantly beaten by Fujitsu SuperSPARC. It would have been better to outsource.

SuperSPARC was an earlier TI manufactured part. Fujitsu was (and I think still is) SPARC64, which was a nice series of parts, originally designed by HAL. I used to own a Fujitsu server - fast and built like a brick outhouse.
I remember evaluating the 960 for an embedded router project and it was quite a nice ISA. Plus the 66 Mhz CA part was fast for the price at the time.
The i960CA was the one of the first superscalar microprocessors. (I wrote a third-party commercial instruction scheduler for it, that operated on assembly code.) It was pretty nice, certainly in line with the other 32-bit RISCy ISAs of the time. My impression is that its relative lack of success was due to Intel internal politics.
Yes within Intel it was thought that management would not push the 960 since if did so it would be picked up by the press as validated RISC is better. But for embedded applications it was very successful, I was shipping hundreds of thousands of them per month at one point
i860 did well in embedded applications and for awhile was the mainstay in most RAID controllers and network communication processors. Not what Intel wanted from it but it did have a long life in such applications. I spent many years working on the i860 and i960 and learned to live with its oddities.

As for the Cell it was overly complex architecture and had remarkable performance under very optimized code. The hope was hand tuned libraries would address this; and compiler optimizations would take care of the rest. Neither happened in a meaningful way. We did two major projects with the Cell using it for real-time HDTV compression/direct broadcast applications.

Another one not on the list was the inmos Transputer. Again similar to the Cell; very complex and fast for its time; but not easy to achieve this performance. That was my first job as an EE - we used it on a GPS receiver ISA card in the early days of GPS. It was a good choice as very fast and could keep up with the signal processing that allowed us to roll code updates to add major features as various changes to GPS signals were rolled out (P-code on L2, SA being turned off, and later CA code on L2 being unencrypted). Our competitors had to redesign ASICS to get these new features which means long product cycles and hardware replacement.

Today I find myself doing a lot on the M1 series, as well as Epyc. Now you can give zero shits about clean optimized code and it still runs amazingly fast. Last time I had to do assembler or intrinsics was many many years ago - and I sort of miss that intimacy with the hardware to get the most out of it.

I think you mean 960 in RAID and comm controllers. The 860 had incredibly bad, almost unbelievably slow context switches. You’d never ever use it in a controller. A dedicated render pipeline is pretty all it was good for, for some value of ‘good’.
I had the same reaction. The i860 and i960 were very different beasts. I owned an 860-based Oki/Stardent workstation, bought for peanuts at the latter company's fire sale, for a while. Later I found the 960CA (in particular) in many storage/network devices. So I kind of know both, but I would never speak of them as if they were the same. Other than sharing a corporate logo, they had little to do with one another.
At least the 960 was somewhat usable. Many variants were created, and several were widely used in embedded products for quite a few years. The 860, however, was Just Crap. Full stop. End of story. IIRC it had weird double-instruction modes that compilers just couldn't handle, and if you used them anyway (for very necessary performance) then handling exceptions properly was all but impossible. Definitely gets my vote for worst ever.
I worked on an unreleased third-party C compiler for the i860. It wasn't that compilers couldn't handle the double-issue float mode, it was more that it was worthless in real-world code due to the entry/exit latency. It had high performance on paper but not in reality, which was exactly the lesson that Intel did not learn for the Itanium.
Interesting that Intel has such an impressive record of failed designs. Itanium, 860, and iAPX 432 - all anti-classics of their time.
I remember articles from Byte hyping it(the 860), also adverts for accelerator cards.

It runs rings around workstations!

"We now know that core 2 dropped all kinds of safety features resulting in the Meltdown vulnerabilities."

Curiously, every other out-of-order chip designer except for AMD also designed CPUs with Meltdown flaws. That's per their own documentation ARM, IBM both Power and mainframe, SPARC, and I think MIPS but they weren't entirely clear about it.

Yes, and no mention of the Transmeta Crusoe either.
It seems like Intel was in some ways like Microsoft. Their revenues were so high that they could survive spectacular failures and still keep going.
> The i960 was take 2 and their joint venture called BiiN also shuttered.

I have an old X-11 terminal I believe has a i960 in it. I’m shocked that thing was capable of running CDE desktops when it stutters on FVWM over a network much faster than it ever was intended to see.

What games were able to make full use of the Cell?
A different twist on the Itanium: technically bad but ended up as a strategic win for Intel.

SGI, Compaq and HP mothballed development of their own CPUs (MIPS/Alpha/PA-RISC) as they all settled on Itanium for future products.

After Itanium turned out to be a flop, those companies adopted x86-64 - Intel killed off 3 competing ISAs by shipping a bad product.

Very true, it was the end of the DEC Alpha as Compaq chose the Itanic.
Itanium was the OS/2 of chips, Microsoft used OS/2 to get IBM chasing a dead end while they baked Windows NT & 95 until their lead was secured.
Interesting take!
What does "worst CPU" mean? I think that it means, regardless of market success, the CPU that most hindered, indeed retarded, progress in CPU engineering history. In this regard, #1 and #2 are clearly the 8088 and 80286 respectively.
Agreed. I think Itanium gets a lot of unnecessary slack. It really tried some exciting new ideas and clean concepts. Not all of those concepts were much of a win, but with the first chip arriving years late then there’s no wonder it was perceived as underwhelming from the get go (that would happen to any chip that’s late)
Funnily, I feel like SIMD instructions are slowly reinventing what the itanium did out of the box.

I think a modern compiler could likely do a good job with itanium now-a-days. However, when it first came out, there simply wasn't the ability to keep those instruction batches full. Compiler tech was too far behind to work well with the hardware.

The problem is, compile-time instruction scheduling for VLIW vs an out-of-order, superscalar processor is inherently unequal because there is an information gap. At compile time you cannot see the actual dependences, and have to statically schedule for the worst case. You can do great on regular, array-based code. But VLIW can never beat OOO superscalar processors on irregular or pointer-chasing code, because there is unequal information. On those codes, the information gap can't ever be overcome, no matter what compiler technology you have. If you don't have access to the at-runtime data values (and you never will have that), there is no static schedule that can compete.
Both VLIW & Ooo Superscalar are doing a different tradeoff to achieve similar goal - maximize ILP.

With VLIW the compiler unrolls the code and tries to find the parallelization but has no control over runtime stalls and results in larger code size. The complexity is in the compiler while the machine is simpler.

With an OOO superscalar machine, you have to dedicate significant piece of HW for stuff that would be easily done by the compiler. The advantage is you can get reduced code size and better performance for non-linear code.

I'm not sure I'd say many compilers are even that great with SIMD these days and that is easier than what the itanium was asking of compilers.

There are real gains to be had by using SIMD but it tends to be massively parallel data processing workloads with specially written SIMD code or even hand tuned assembly (image/video processing, neural networks) not just feeding in a source file and compiling with the SIMD flag to then realize meaningful gains.

The reverse is true.

SIMD is harder because you have to have a uniform operation across a set of data.

Imagine a for loop that looks like this

    int[] x, y, z;
    int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i]
       d[i] = z[i] * x[i]
       q[i] = y[i] + z[i]  
    }
For SIMD, this is a complicated mess for the compiler to unravel. What the compiler would LIKE to do is turn this into 3 for loops and use the SIMD instructions to perform those operations in parallel.

The itanium optimization, however, is a lot easier. The compiler can see that none of p, d, or q depend on the results of the previous stage (that is q[i] doesn't depend on p[i]). As a result, the entire thing can be packed into a single operation.

Now, of course, modern OOO processors can do the same optimization so maybe it's not a huge win? Still, would have been something worth exploring more (IMO) but the market forces killed it. Moving that sort of optimization out of the processor hardware and into the compiler software seems like it could lead to some nice power/performance benefits.

That loop is actually nicely vectorizable, at least assuming that you replace int with float (there is no integer division vector instruction on x86).

All of the array accesses are uniform, so the resulting vector code is roughly:

  for (i = 0 .. size by vector width) {
    r0 = vector load x[i..i + vw]
    r1 = vector load y[i..i + vw]
    r2 = vector load z[i..i + vw]
    r3 = r0 / r2
    r4 = r2 * r0
    r5 = r1 + r2
    vector store r3 to p[i..i + vw]
    vector store r4 to d[i..i + vw]
    vector store r5 to q[i..i + vw]
  }
(and probably unroll the loop for good measure). No need to fission the loop to vectorize here.
>> For SIMD, this is a complicated mess for the compiler to unravel

this is trivially vectorizable for simd, would fit nicely in a vliw packet too. The only issue is if there was a runtime memory stall with any access, then the entire pipeline would stall.

with predication, modern simd even parallelize if conditions like below.

int[] x, y, z; int[] p, d, q;

    for (int i = 0; i < size; ++i) {
       p[i] = x[i] / z[i];
       d[i] = z[i] * x[i];
       if(i>n) {
         q[i] = y[i] + z[i]  ;
       } else {
         q[i] = y[i];
       } 
    }
VLIW architecture is so bad that AMD and Nvidia couldn't make it work well with embarrassingly parallel graphics code. AMD first moved from VLIW-5 to VLIW-4 because they couldn't find enough data to reliably keep unit 5 busy.

AMD then followed Nvidia into the world of SIMD/SIMT because it offered better real-world performance for the majority of applications.

VLIW has been tried repeatedly only to be replaced with something that worked better.

Also was very amusing when we shipped test boxes...that sucker ran really hot, and I got one call about the Itanium box asking for tech support help and the report was that the box was on fire.
Of course, what constitutes "worst" is a difficult question.

Signetics made the 2650, a nice processor with a highly regular architecture with a condition code register. After every arithmetic operation including loads and stores the ALU updated the condition code register.

The National 32032 processor was a wonderful part with a clarity of design that made it a great choice for a workhorse processor. Unix running the machine was stable and efficient except that every few weeks there would be disastrous crash. With a tremendous amount of effort the source of the problem was found: a race condition in the interrupt control logic that returned from the wrong stack and scribbled over memory.

The Intel i860 exposed the internal computational pipeline to the programmer. Context switching was complicated by the conflict of real-time operating performance requirements and a deep pipeline with no way to grab the context and drain the pipeline. Eventually a dedicated team got a Unix OS running on the part, but it peformed poorly.

The Maspar MP-1 was a SIMD machine. It was cool to test new library functions by seeing if, say, sqrt(x)*sqrt(x)==x for all floating point numbers. Customers wanted the Maspar machine to be timeshared, but the architecture made it difficult to do since the CPU state was very large and memory was not mapped.

Intel's 8048 (and simplified versions like the 8021 and enhanced versions like the 8051) did not perform as well in terms of speed or code size as many of the competing micro controllers. The competition offered very simple asymmetric complex architectures which could be programmed (possibly with external hardware assists) to accomplish embedded tasks with significant effort and several days or weeks of effort. The Intel part was not quite as efficient in memory use and speed, but could be programmed in an afternoon. And another engineer/programmer could look at the code and understand it without much deep thought.

The Motorola 68000 was a wonderful machine with a clear instruction set. But the original 68000 could not support virtual memory.

There have been all sorts of different architectures tried which seen strange today but came about because the architecture was thought to provide an engineering solution to an immediate problem. There was a time when register machines were thought to be a bad architecture, far inferior to a simple stack architecture.

I would vote for the Pentium IV for all the reasons mentioned in the article, but more importantly because it was initially coupled with Rambus memory. Intel pushed that tech so hard to try and squeeze out AMD. Super high frequency, high bandwidth, high expense memory with terrible latency was not the future anyone wanted. Intel's hubris back then was off the charts.

I know intel wanted Itanium to succeed for the same reasons, but the PIV came very close to home since it actually shipped for consumers. Oddly enough, Extreme Tech was a huge shill for Intel back in those days. Funny they don't mention that in this article.

I'm currently building a homebrew system built on the TMS99105A CPU, one of the final descendants of the TMS9900.

It's a nifty little CPU. There's a lot of hidden little features once you dig in. It can actually address multiple separate 64k memory namespaces: data memory, instruction memory, macroinstruction memory, and mapped memory with the assistance of a then-standard chip. Normally these are all the same space and just need external logic to differentiate them. There's also a completely separate serial and parallel hardware interface bus.

The macroinstruction ("Macrostore") feature is pretty fun. There's sets of opcodes that will decode into illegal instructions that, instead of immediately erroring out, will go looking for a PC and workspace pointer (the "registers") in memory and jump there. Their commercial systems like the 990/12 used this feature to add floating point and other features like stack operations.

Yup, there's no stack. Just the 16 "registers," which live in main memory. There are specific branch and return instructions that store the previous PC and register pointer in the top registers of the new "workspace," allowing you direct access to the context of the caller. The assembly language is simple and straightforward with few surprises, but it's also clearly an abstraction over the underlying mechanisms of the CPU. I believe this then classifies this CPU as CISC incarnate.

There are some brilliant and insane people on the Atari Age forums! One of them managed to extract and post the data for a subset of those floating point instructions, and then broke it all down and how it all worked. Some are building new generations of previous TMS9900 systems. One of them is replicating the CPU in an FPGA. A few others are building things like a full-featured text editor and, of course, an operating system.

I've learned a hell of a lot during this project. I've been documenting what I'm doing and am planning to eventually make it into a pretty build log. I think this is a beautiful dead platform that deserved better.

I have a soft spot for that CPU. My first computer was a TI99/4a when I was about 14 or 15. I started with BASIC, then learned assembly language on that machine. I give it a lot of credit for starting the trajectory my future took.
The TI's serial I/O bus takes the prize, for me.
Man, that 6x86 CPU is still getting the short end of the stick nearly three decades later despite being a pretty solid chip.

So, first it generally had a higher IPC than anything else available (ignoring the P6). So, the smart marketing people at cyrix decided they were going to sell it based on a PR rating which was the average performance on a number of benchmarks vs a similar pentium. AKA a Cyrix PR166 (clocked at 133Mhz) was roughly the same perf as a 166Mhz pentium. Now had they actually been selling it for a MSRP similar to a pentium 166 that might have seemed a bit shady, but they were selling it closer to the price of a pentium 75/90.

Then along comes quake which is hand optimized for the pentium's U/V pipeline architecture and happens to use floating point too. And since a number of people had pointed out the Cx86's floating point perf was closer in "PR" ratings to its actual clock speed suddenly you have a chip performing at much less than its PR rating, and certain people then proceeded to bring up the fact that it was more like a 90Mhz pentium in quake than a 166Mhz pentium (something i'm sure made, say intel, really happy) at every chance they get.

So, yah here we are 20 years later putting a chip with what was generally a higher IPC than its competitors on a "shit" list mostly because of one benchmark. While hopefully all being aware that these shenanigins continue to this day, a certain company will be more than happy to cherry pick a benchmark and talk up their product while ignoring all the benchmarks that make it look worse.

Now as far as motherboard compatibility, that was true to a certain extent if you didn't bother to assure your motherboard was certified for the higher bus rates required by the cyrix, and the other being it tended to require more sustained current than the intels the motherboards were initially designed for. So, yah the large print said "compatible with socket7" the fine print later added that they needed to be qualified, and the whole thing paved the way for the super socket7 specs which AMD made use of. And of course lots of people didn't put large enough heatsink/fans on them which they needed to be stable.

So, people are shitting on a product that gets a bad rep because they were mostly ignorant of what we have all come to accept as normal business when your talking about differing micro architectural implementations.

PS: Proud owner of a 6x86 that cost me about the same as a pentium 75, and not once do I think it actually performed worse than that, while for the most part (compiling code, and running everything else including Unreal) it was significantly better than my roommates pentium75.

6x86 PR200 was really fast in Linux of the day. The fact that it had 256 kB cache also helped.
Which brings up another fact, which was that microsoft disabled the cache on cyrix processor in one of the versions of windows NT (3.51 or 4?). And so you had to download a driver from Cyrix to turn it back on. But that didn't keep various people from claiming it's perf sucked in windows NT too.

IIRC the official excuse when this became public was that a MS engineer turned it off because one of their test machines couldn't complete a stress test with it enabled, but later it turned out the root cause was a bad motherboard. The curious part being that it didn't result in MS immediately issuing a hotfix to turn the cache back on.

edit: found one of the articles mentioning this. https://www.tomshardware.com/reviews/bananas,9.html

Apparently it was just writeback mode that got disabled, either way that link mentions a 30% perf hit.

OK then, I was very heavily involved in both the item in the Intro (the flaw in the first Pentium, I was the production control guy in the sole source factory) and #1 on the list (Itanium, I was trying to get hardware companies to work with their software suppliers to port to the new architecture using a very significant budget).

The common thread was Intel marketing pushing something that was a dog for marketing reasons

1. It is very amazing not in a good way when you think you have enough inventory but someone from HQ calls up the warehouse and has the older CPUs crushed by a bulldozer (you don't want to throw them out, they are quite usable)

2. Was amazing that sucker ran so hot tech support got a call about test boxes catching on fire

Anyone remember Pentium II and their new <del>sockets</del> cartridges?

That didn't last long. Like what, one generation?

Good.

(saying that, but I remember purchasing a dual Pentium II motherboard for 2 400 MHz CPUs to speed up 3DStudio 4 renderings under Windows NT4... xD)

The reason why they went down the slot route was for packaging reasons.

Cache was still external at that point. There would be performance benefits from brining it on die, but larger chips are more expensive to make & using two smaller dies (one for CPU & one for cache like the Pentium Pro) is still quite expensive.

The middle ground was to put the CPU and cache on a single PCB, so you end up with a cartridge form factor. By the time the next generation rolled around it was possible to put the CPU and cache on the same die at a reasonable cost (Moore's law), making the cartridge form factor obsolete.

There were Pentium IIIs in slot form as well - I encountered one at work many years ago. AMD also had a "Slot A" version of the original Athlon, which was quickly ditched.
Pretty sure there was a slot Pentium 3.

I thought it was cool at the time, made me think of a NES cartridge.

It was called Slot 1. The first computer I built for myself used it, circa 2001.
Ze Fuji Quicksnap CPUs.

(Single use analog pocket cameras)

Anymore details on those? I can't find any info on the CPUs inside
It was an analogy. Because of the formfactor, which was very similar at the times. Fuji still makes things labeled as quicksnap for the same purpose, but they look very different now.
I'm so ashamed to have owned a Cyrix, a P4, and an AMD Bulldozer.

They were all awful.

I had two Bulldozers. Bulldozer wasn't competitive at the top end, but I always found Athlon chips to be cheaper than their performance equivalent Intel part. So the fastest AMD chip would be cheaper than the third fastest Intel part. Still a good value. Terrible for AMD's bottom-line though.
Fair enough. I feel like I got decent value out of my Athlon. At the time, it sure seemed like a gross power hog. I suspect I would be shocked by its modest TDP if I went and looked back at specs.
I’ve had a P4 and I didn’t consider it “awful”.

It was without a doubt the fastest CPU I had ever had at the time, but boy did it generate heat and need cooling.

That machine sounded like a always on vacuum-cleaner.

I owned a Pentium 4 as well (oh boy did I save for a long time as a teenager to afford that). It wasn't really as bad as what this article claims. On the other hand, the dual-core parts probably really are that bad.
The article did call out in particular the late generation P4s with the super duper extra long pipeline that simply couldn't keep themselves fed when working with anything but synthetic benchmarks.
Intel was too expensive for me, so I ended up buying a Cyrix => performance (floating point) was terrible in Falcon 3, I was sooo sad - but on the other hand that gave me until today the push to really focus on details before taking a decision => thank you Cyrix for having changed my life hehe.
Nothing to be ashamed of on the cyrix and AMD, both were better price/perf than what you would have bought with the same money from intel. The same can't be said of the P4, which was right in the middle of AMD giving intel a good solid whumping.
For P4 i underestand ( legend says that P3 was faster at the same clock rate and that's why there are no P4 at the same speed as P3). But Cyrix and Buldozer ?
I visted this page hoping to see the PowerPC 970 top of the list, but all it gets is a "Dishonorable Mention". After going through three PowerMac G5s, all of which had their processors die within 4 years, I still bear a grudge.
Surprising; never knew anyone whose G5s died on them (the systems, sure, but not the CPUs). My dual '04 cpus are still chugging along just fine.
The hotter running G5s had liquid cooling that would inevitably leak and corrode everything.
I'm pretty sure they have a wiskers or wire bonding problem too, and the water blocks clog.

I picked one up that was labeled "crashes while booting" or some such from the goodwill near my house for something like $20 some years back. Brought it home, and noticed that the water block got burning hot when it was turned on, and tubes feeding the radiator were room temp. I broke the water loop open and flushed it out, and a whole bunch of white crap came out of the block. So, whatever the coolant apple shipped with it, was clogging the block. Reassembled the whole thing, had a terrible time getting the air of the system, but in the end it ran pretty good for a while until I left it off for a few months, and it refused to boot. In an act of desperation I hit it with the heat gun and that magically fixed it for a few weeks, and it did the same thing like a year later when I tried to boot it again.

I ran some benchmarks on it to compare with a POWER4 I also have, and yah lots of clock, shitty IPC. It was really cool in 2001, but by the time apple was putting them in mac's they were pretty terrible in comparison to the amd/intel's.

For us non Apple users, how is that possible? I don't think I've ever had a CPU die other than by lightning.
I don't know what OP was running but the G5 iMacs were some of the machines suffering from the early 2000s capacitor plague[0]. The power supplies and power regulation on the logic boards would die on those all the time. If you were lucky it was just the power supply but the problem usually needed a PSU and logic board swap.

[0] https://www.cnet.com/culture/pcs-plagued-by-bad-capacitors/

The processor in a G5 PowerMac came on a card that had the VRMs, capacitors, and a bunch of other stuff on it. It was basically like a tiny motherboard that attached to your motherboard.
I was imagining lightning struck the cpu specifically, leaving the rest intact? Quite the precision.
Oh no, the last time this happened there were definitely other casualties. The motherboard was left in a particular state of undeath, where it wouldn't quite power on. But if you jumped the ATX header it'd sort of attempt to boot and give some beeps.

After that I added a bunch of grounding to my house and I haven't had that much damage in one lightning strike before.

If I remember correctly it didn't have the biendian capability of the G4 so Virtual PC wouldn't run.
Virtual PC for Mac did get an update to run on the G5.
Yeah, but it had some performance issues. The 970 was such a bummer. I read the book The Race for a New Game Machine, and the crap show around Apple, the 970, and especially the Cell was just so infuriating.
I wondered what happened to the head of Cyrix, Jerry Rogers. He died 2 years ago:

https://obits.dallasnews.com/us/obituaries/dallasmorningnews...

For a bit of time, I ran an over clocked FX 8320 and crossfire 7970's. The heat that machine put out was tremendous. I only had a wall mounted AC unit so I had to practically take my shirt off when I loaded it up.
Ah yes, AMD/ATI crossfire. I had nearly forgotten that was a thing...
This was back when I was a student and had the FX and one GPUs. Getting an internship meant that I had the money for an upgrade, and the cheapest, most straightforward was to get a second GPU, or so I thought. Wasn't even that cheap because I had to put both the GPUs under water to keep them from overheating when both in the same PC.
My first PC had a cyrix 333Mhz CPU. Ran just fine! But I was learning c in Borland turbo c and djgpp so it didn't have to do much. Running java on it... Well that wasn't fun with the 32MB RAM.

Worked on itanium too. It was more amazing Microsoft actually had support for it.

I've owned 4 or 5 of the CPUs on that list over the years. I'm sure there are worse.
Cyrix wasn't the first company to build SoC, Acorn was.
You are referring to the ARM 250 chip in the Acorn A3010, A3020, and A4000 https://en.wikipedia.org/wiki/Acorn_Archimedes
Yes, thank you!
"Note: Plenty of people will bring up the Pentium FDIV bug here, but the reason we didn’t include it is simple: Despite being an enormous marketing failure for Intel and a huge expense, the actual bug was tiny."

The fact that the fault was tiny and that few people were affected is definatly NOT the point.

The so-called Pentium 'bug' was the result of fundamentally terrible engineering on Intel's part in that the underlying design wasn't fit for purpose - it wasn't just a bug.

It seems to me the authors of this story do not understand the implications of what Intel did was fundamentally wrong in that its math processing was flawed by design from the outset or otherwise they would have included the Pentium in their list.

In order to achieve increased math processing speed, Intel broke mathematics algorithms down into part algorithm and part lookup tables - that is instead of having mathematics algorithms complete the whole task (which is the logical way of doing things). If the mathematics algorithm were wrong then every calculation would also be wrong and thus the problem obvious from the outset. Adding a lookup table makes calculations faster but one would then have had to test every combination in the lookup table - and Intel didn't.

Look at the problem like this - think of a set of log or trig tables, now think of the implications if one of those table entries is incorrect. What Intel did was deliberate cheating and it failed to get away with it. Intel would have known this from the outset and thus the problem was an integral design fault rather than a bug.

Intel knowingly implemented a design that had flawed data integrity at its most fundamental level. What Intel did was so nasty that it's hard to think of how it could have made matters worse than if it had deliberately tried to introduce a fault.

In my opinion, any company that would stoop to such low ethical tactics as Intel did with the Pentium's design would have demonstrated that it cannot be trusted - and I've never trusted Intel from that point onward.

If anyone ever needs a reason for why processors should have open design architectures that are subject to third-party scrutiny then this is the quintessential example.

This is a bit hyperbolic. Intel implemented a known and popular algorithm (SRT [1]) with a standard LUT for the bit patterns expected in IEEE754 FP numbers. They were not the first, last, or only microprocessor design firm to do so. A fault in a script that copied the LUT values to the machines that program the PLAs as part of the manufacturing process led to 5 missing values in the LUT (set to 0), out of 1066 entries.

There's a great writeup with the results of Intel's internal investigation [2], which outlines the challenge in testing production chips for this sort of bug. A key point:

> The fraction of the total input number space that is prone to failure is 1.14 x 10^-10.

So around 1 in 9 billion possible numerator/denominator pairs exhibit the bug. Testing 9 billion double-precision FDIV divides on a 60MHz Pentium would take almost four days, if my math checks out and the CPU could do 2.5 billion divides per 24 hours.

[1]: https://en.wikipedia.org/wiki/Division_algorithm#SRT_divisio...

[2]: https://users.fmi.uni-jena.de/~nez/rechnerarithmetik_5/fdiv_...

Hyperbolic or otherwise, it happened and I remember it well (it changed my purchasing decisions at the time).

I'm aware of most of those details as I took a keen interest in the matter at the time. I'm also aware of the argument for the use of said algorithm.

Whether one adopts this approach or not is philosophical argument and I just happen to believe it's bad (and ugly) engineering - and in this case witnes the outcome, it cost Intel dearly in both monetary and PR terms.

> In order to achieve increased math processing speed, Intel broke mathematics algorithms down into part algorithm and part lookup tables - that is instead of having mathematics algorithms complete the whole task (which is the logical way of doing things).

Can you expand on this? I thought all FPUs used lookup tables? Even the 8087 had them.

I think fpus still do for things like trig functions. Doing it using a power series potentially gives bad results and takes a long time to get enough accuracy. I think it was pretty common to use lookup tables in various algorithms back then since it was way faster to do a memory access and then some interpolation or just a memory access than to do a bunch of calculations.
> In order to achieve increased math processing speed, Intel broke mathematics algorithms down into part algorithm and part lookup tables - that is instead of having mathematics algorithms complete the whole task (which is the logical way of doing things).

This is nonsense. There's no functional difference between "lookup table" and "algorithm" (whatever that means) when it comes to a circuit design. Both are perfectly valid ways, nothing inherently wrong with either.

See links in tadfisher's reply, they provide a summary. (My comment was simplified for the HN post, his links provide a comprehensive description).
As a system builder for a "custom computer shop" back in 1997/98, I came here just to make sure Cyrix was on the list.
No IDT WinChip though, that's mildly surprising
I don't think the Winchip was that well known. But it never pretended to be a high performing design.
The list includes the MediaGX, so I thought it might be comparable in popularity/performance expectations.

To be honest, I remember the WinChip because I thought the printed Windows logo on the CPU was pretty cool. Texas and AMD also had the "designed for Windows 95" logo on some of their CPUs (some 486 and Overdrive designs, iirc)

CTRL+F "transmeta crusoe": Not found

ah well

My vote for worst CPU goes to the iAPX 432 (also not on this list).
Wow, a garbage collector implemented inside of the processor. Chip level support for objects. You can't fault Intel for their ambition here, just their common sense.

And the whole thing is built for a world where everybody is writing code in Ada. I bet some compiler makers were salivating at the prospect of collecting all of those huge license fees from developers.

It was a different time - memory/CPU speed trade-offs were very different - we saw RISC once we were able to move cache on-chip (or very very close) - but at that point CISC made sense and the 432 pushed CISC to the extreme.

IMHO the x86 won out (an d is still with us) because of all the CISCs of its time it was the closest to RISC when memory started to get a lot faster (almost all instructions make at most 1 memory access, few esoteric memory operations etc)

All instructions take at least 1 memory access. All instructions that do memory access need at least 2 memory accesses.
I once encountered a note from one of the people working on iAPX 432, claiming that the core idea of high level cpu wasn't really the issue it tanked, but project mismanagement and horrible design applied which resulted in a chip that would be technologically at home... In 1960s, just done in VLSI - one of the things I recall were issues with actual physical implementation of the memory data paths resulting in horrible IPC
I was looking for this as well. It should be on there for introducing a completely new architecture, costing more, and underperforming contemporary products from Intel's catalog.
When I saw the headline I expected the iAPX to be number one on the list.
Isn't the i860 the inheritor of iAPX 432 design details?
Oh those were the days were I was young and naive and I thought Linus was going to change the world (again) blurring the lines between 'software' and 'hardware'
I think transmeta was MUCH better than Itanium.

Itanium held the idea that we could accurately predict ILP at compile time (when the halting problem clearly states that we cannot).

Transmeta said VLIW has the best theoretical PPA possible, so let's wrap that in a large, programmable JIT to analyze/optimize stuff to take advantage.

Modern CPUs run quite a bit closer to transmeta, but they largely use fixed-function hardware rather than being able to improve performance at a later time.

If we could nail down that ideal VLIW architecture, we could sell a given chip at various process sizes and then offer various paid "software" upgrades or compatibility packs for various ISAs to run legacy code.

At least there's a pipe dream worth looking into.

> Itanium held the idea that we could accurately predict ILP at compile time (when the halting problem clearly states that we cannot).

I don't know where these notions are coming from.

Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data.

Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible.

Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years.

Side note: the instruction reordering actually poses a problem for parallel code. Language writers and compiler writers have to be extra careful about putting up "fences" to make sure a read or write isn't happening outside a critical section when it shouldn't be.

If your assertion had any weight at all, EPIC would have taken over.

> Compilers can (and do) reorder instructions to extract as much parallelism as possible. Further, SIMD has forced most compilers down a path of figuring out how to parallelize, at the instruction level, the processing of data.

Peephole optimizations are literally just rewrite rules and very limited in what they can accomplish, but we can't find an even moderately reliable way to optimize larger bits of the program. Auto-vectorization is still so bad that even unskilled devs can probably do a better job by hand.

> Further, most CPUs now-a-days are doing instruction reordering to try and extract as much instruction level parallelism out as possible.

This is true and proves my point rather than yours. If the compiler could do the job, then the VLIW output would be faster and not require OoO execution. It's telling that the fastest versions of Itanium were the ones that took the incoming VLIW commands and ripped them apart into a traditional OoO instruction window effectively negating the whole idea while preserving the externally-facing ISA.

> Figuring out what instructions can be run in parallel is a data dependency problem, one that compilers have been solving for years.

If they solved it years ago, then why do we get such MASSIVE ILP boosts from bigger instruction windows? Why is 2-3 instructions of throughput the maximum efficiency we can get from in-order systems?

There are few issues with Itanium-like architectures.

The first thing to point out is that the dynamic filling of the execution units in superscalar hardware will always do no worse than whatever a pure-compiler solution can do, and will very frequently do better. Hardware can take advantage of dynamic opportunities, such as the ability to fill execution slots from code both before and after a branch (or even across function boundaries!), or being more responsive to instructions with data-dependent execution times. Yes, this does take not-insignificant amounts of hardware. But given the limitations of what compilers can statically do, it's not clear that you can put the savings to better use.

The second issue is that such an arrangement usually ends up with the hardware encoding microarchitectural details into the ISA. And when you do that, and you desire to change microarchitecture, you're stuck with either changing the ISA and dealing with attendant issues, or you have to add the hardware that you're theoretically saving in the first place.

On top of this, you're struck with practical performance being driven by the availability and adoption of sufficiently smart compilers, which is largely out of your control.

It's worth noting that you can ameliorate these issues to a larger degree if you restrict your inputs to a more structured subset of possible programs, i.e., you try to build an accelerator instead of a general-purpose CPU. And that's why you see more interesting architectures come out in the accelerator space. But for most general-purpose programs, you're not really going to do better than modern superscalar architectures, even with all the space and power they consume.

The critical difference is that EPIC (the architecture model of Itanium) essentially exposed CPU pipelines naked to the code - so you didn't just have to reorder instructions as optimizers do today, you also had to figure out changes that experience so far suggests is doable either in hw with runtime-only data, or in very tight numerical code. This includes compiler taking the place of branch predictor as well as OOOE scheduling, as well as no on-cpu instruction reordering or out of order retirement, and IIRC a branch mispredict was quite costly.

More over, EPIC pretty much meant thar you couldn't apply similar chip-level IPC improvements as you could elsewhere, at least originally.

I'm not sure that branch prediction would need to go to the compiler, but definitely agree it'd likely subsume the OOOE scheduling (at very least, it'd be less effective).

That, though, seems like it might make for a good power/performance tradeoff. Those circuits aren't free. We just didn't get to the point where compilers were doing a good job of that OOOE reordering (not until after EPIC died).

The real reason, though, that itanium died (IMO) is most businesses insisted on emulating their x86 code at a 70% performance cost. So costly that it seems like intel/hp spent most of their hardware engineering budget making that portion fast enough.

The x86 emulator built into Itanium 1 was very bad, yes, but it didn't matter that much outside of workstation use. HP build Itanium 2 without it, and provided software emulators for x86 and HP-PA that worked apparently "well enough".

The real deal breaker was Itanium being ridiculously expensive and quickly destroying any possibility of increased market by pricing itself out of it - and even in the markets that had the money, it was considered overpriced (nicest thing I heard about Itanium was "overpriced DSP masquerading as general purpose CPU"). I remember reading intel's published roadmaps before news about amd64 landed - We would be running 32bit x86 much longer under it, with Itanium being kept at extra premium prices.

Even customers that had Itanium as the only upgrade path available - thanks to HP - found the performance so bad - on natively compiled code! - they effectively forced HP to produce Alpha till Itanium was pretty much confirmed dead and the customers migrated out of HP vendor-locked stack (at one of the largest mobile telcos in Poland we migrated from Alpha to IBM POWER, many OpenVMS customers kept buying/hoarding Wildfire and Marvel architecture servers).

I had a Transmeta Crusoe based PC104 SBC and for the time it was relatively quick for something low power, does it really deserve to be in the "worst cpus ever" list? why for?
You’re right. I was kind of being snarky. But it was a huge disappointment compared against the promises from Transmeta and the tech journalism of the time.
The lack of Alpha seems odd, though maybe that should be the worst ISA rather than merely individual CPU?
The ISA wasn't that bad, but the weak memory-ordering model was a huge pain in the ass. I worked for a while with some of the Alpha folks years later, and they did a lot of really great work, but they did bring that weak memory model along with them. It allowed us to find many Linux kernel bugs that had lain dormant since Alpha because nothing since had repeated the mistake. Fun times ... not.
Why would Alpha be the worst? I’ve owned 2 of them, 21064 and 21264, and they were fast and reliable.
The only two architectural questions that I know of were...

The weak memory model:

https://devblogs.microsoft.com/oldnewthing/20170817-00/?p=96...

Inability to address low-power designs:

https://en.m.wikipedia.org/wiki/StrongARM

"According to Allen Baum, the StrongARM traces its history to attempts to make a low-power version of the DEC Alpha, which DEC's engineers quickly concluded was not possible."

The other major problem with the Alpha was the high license costs of DEC operating systems, which greatly helped put it in the grave.

And incapable of working with unaligned values or values smaller than 4 bytes. Weren’t there also cache coherence issues?

Alpha kinda had you finish the hardware in software.

The influence of Alpha on modern instruction sets like ARM64 and RISC-V is tremendous. It’s just sad it had to die for this to happen.
It didn't die.

Intel bought it from HP, stripped it for parts, then killed it.

HyperTransport and a few other things were essentially just copies of Alpha's stuff cleanroom implemented by ex-Alpha employees. Designs like Sandy Bridge look quite similar to EV8. QuickPath is just Alpha's interconnect with some updates (HyperTransport was also a cleanroom copy from ex-Alpha employees). Even AVX seems inspired by the 1048-bit SIMD planned for EV9/10.

I'd also add that a lot of excellent ex-Alpha engineers (e.g. Jim Keller, Dan Dobberpuhl off the top of my head) ended up designing great chips at other companies.
How did the Alpha ISA influence RISC-V, other than by its counterexample? Does RISC-V lack an integer divide? "Design of the RISC-V Instruction Set Architecture" mainly uses Alpha in the phrase "Unlike Alpha, ..." i.e. as a warning to future people. In fact, the author fairly well excoriates all of the historic RISC architectures for being myopically designed.
Give RISC-V time, it will be somebody's bad example soon enough.
My impression is that Alpha's ISA was mostly fine except for the power draw, DEC just didn't have the R&D budget to keep up with Intel and all of the foundries and had their lunch eaten by x86 just like every other chip designer in the 80s and 90s.
Alpha was astonishing when it came out. It ran x86 code in emulation faster than any real x86 could go. Its only serious flaw was its chaotic memory bus operation ordering, which came to matter when you had two or more of them. Alpha died because DEC died, not the reverse.
x86, SPARC, Cell, EPIC, iAPX, i860, and even contemporary ARM versions are worse. If we reach into lesser-known ISAs or older ISAs, we could add a TON more to that list.