Growing up in the 80s, the Cray 2 was my poster computer. Until the day came when I realized it was only 800MHz and had 4GB RAM, and my phone was faster than that.
I often thought this, but a Cray is not a normal computer. From some comment, I learned that it had a 8MB L1 cache, meaning it could crunch all of it in one go, then refill, so on and so forth. It means these little 800MHz were used to their fullest, while nowadays multicore GHz chip have to wait a lot to get new data in smaller quantities. Making a Cray still as fast today (for adequate tasks of course).
So todays processor have higher peak bandwidth, on average, Cray can sustain larger bandwidth.
There are programming techniques where you offload all the computation onto (via OpenGL!) onto the CPU. I did this for image processing where we needed to do histogram normalization. It was not fun.
But the GFlop numbers given are maybe an order or two of magnitude off from achieved performance for something like Lapack.
I never thought about it that way (I find supercomputer comparch design fascinating for the tradeoffs chosen more than anything else), but the RISC/CISC alternatives can be thought of as an optimization for main memory latency.
If memory access is fast compared to the CPU, RISC designs are more optimal (as you can increase the CPU frequency). If memory accesses take longer than CPU execution, then CISC designs start to make more sense (do more complex things in one go once the CPU is done waiting on things to finally arrive).
So todays processor have higher peak bandwidth, on average, Cray can sustain larger bandwidth.
ps: https://archive.is/FWzLF read jojomonkeyboy comment