You'd be surprised how capable old GPUs are! I've had great success with people running Whisper-Turbo in the browser on really old hardware: https://whisper-turbo.com/
It's not the inference, it's the training. They say in the paper: "We train with a batch size of 256 for a total of 80,000 optimisation steps, which amounts to eight epochs of training." That's a fair chunk of time. Mind you, `small.en` has smaller decoder layers than `medium.en`...
TLD a six year old ~$100 used GTX 1070 is roughly 5x faster than a Threadripper PRO 5955WX at a fraction of the cost and power.
[0] - https://heywillow.io/components/willow-inference-server/#ben...