Keep in mind that supercomputers are a lot less specialized than circuits for running neural nets.
12 years ago you could have gotten a stack of 5-8 7800 GTX cards and had 1.5TFLOPS of single precision. 11 years ago you could have had a stack of 5 cards with unified shaders. It's not fair to compare against the significantly more complicated route of getting 100 CPU cores working together with only 1-4 per chip.
But can't you configure the device to do e.g. fast matrix-vector multiplications instead of inference? I can be wrong, but I suspect that's what people do mostly on supercomputers anyway.
12 years ago you could have gotten a stack of 5-8 7800 GTX cards and had 1.5TFLOPS of single precision. 11 years ago you could have had a stack of 5 cards with unified shaders. It's not fair to compare against the significantly more complicated route of getting 100 CPU cores working together with only 1-4 per chip.