I dunno about openAI as a service, but on my M1 mac i think whisper took something on the order of 8x realtime to process with the "large" language model. That is to say... 8 minutes of processing for every 1 minute of audio. It was surprisingly not fast. I assume openAIs servers have more GPU at their disposal to make this go faster.
Are you using whisper.cpp? You really want to be using that if you care about speed. You should be able to get better than real-time transcription on an M1.