I saw a tweet once that went something like, "Mankind strayed from god when we invented the IC", and some days I feel that. Economies of scale crush everything, including our nerdy fondness for particularly elegant ISAs.
You're right there. As transistors got cheaper, we've got into the habit of using them en-mass to prop up horrible architectures with weird (but effective) tricks like enormous look-ahead, stupendous pipelines, and gigantic caches.
I recall with considerable fondness those ISAs that could be understood in ultimate detail by a simple human - the monolithic supercomputers like the Cray 1.
I dunno. There was that strange time in the 1980s where progress in microprocessors hung in the air, where Apple couldn't really find a sequel to the Apple ][, where the TRS-80 Model 4 wasn't much better than the Model 3 which wasn't better than the model 1, where Commodore came out with a new machine every year but only a few of them made any traction.
Coding assembly for the 8088 it was painfully obvious that instructions were competing on the bus with data, which is why the string instructions were so important
Today people would scoff at that sort of thing because a tight loop can sit in the I cache and be just as efficient as a microcoded string instruction.
I traded my Coco 3 (which unlike the PC had a real multitasking OS) for an 80286 machine and that was a massive jump in performance because the 80286 was starting to get those complex features that would start the "Moore's Law" period where computers got notably better on a year by year basis. The awful truth of memory latency really forces you do those "horrible" things if you want to get near the performance that is possible.
Today I am an AVR8 fan because it has separate buses for instructions and data and gets awesome performance for something very simple that doesn't use all the tricks that later processors use. It's the last 8-bit processor so it stands head and shoulders above the rest in terms of clean design and it's got a mainframe-sized register file. In assembly language you can frequently keep most of the variables you use all the time in registers, dedicate a few registers to the interrupt handler so you don't need to swap registers, etc.
As for the Cray and the IBM 3090's vector units it was nice that those things had vector instructions that weren't bound to a particular implementation length unlike the SIMD instructions that Intel has so often fumbled with that require you to rewrite your code every two year if you want to keep up, aren't available across the line (so people other than national labs and Apple don't use them) and are arguably a waste of power and die area at this point.
I find it intriguing that memory is not just A bottleneck these days, but that memory is SO much slower than CPU. A big part of the performance difference must be the move off the main chip, but it just seems that when cache can be so quick, memory should be faster than it is.