| Some technical context and where this is headed. Why streaming matters for dictation.
Whisper and most open-source STT models use bidirectional attention, meaning they need the full audio clip before they can transcribe anything. You get your text after you stop talking, usually with a noticeable delay. Voxtral Realtime takes a different approach: it has a causal audio encoder that processes audio left-to-right as it arrives. At 480ms delay it matches offline models on accuracy (FLEURS benchmark), but you see text appearing while you're still mid-sentence. For dictation this changes a lot. You can catch mistakes in real time, and the feedback loop feels natural instead of disconnected. The app connects to backends via the OpenAI Realtime API WebSocket protocol. It captures audio from your mic, streams it over the WebSocket, and receives partial transcripts that get inserted into your active text field live. Any OpenAI Realtime-compatible server works. The voxmlx fork.
The original voxmlx by Awni Hannun does local Voxtral inference on Apple Silicon via MLX, but it was CLI-only. I added a WebSocket server that speaks the OpenAI Realtime protocol so localvoxtral (or any compatible client) can connect to it. I also added memory management to avoid OOM on longer sessions. Fork is here: https://github.com/T0mSIlver/voxmlx. I'd like to get the server piece upstreamed eventually. Latency.
On M1 Pro with a 4-bit quantized model, first words appear within roughly 200 to 400ms. On RTX 3090 via vLLM it's faster. Both feel responsive enough for natural dictation.
What's next. Right now you have to start the server yourself before using the app. I want to add app-managed local serving (start/stop/model download) so it's truly one-click. If anyone has experience bundling Python/MLX processes into macOS apps cleanly, I'd love to hear your approach. Happy to answer questions. |
This is an example python app wrapped in a (macOS) native shell using Electrobun: https://github.com/blackboardsh/audio-tts
Can you report how well Voxtral Realtime compares to the other currently supported streaming models? https://rift-transcription.vercel.app/local-setup
- Subjectively I've found Web Speech API feels the best (accuracy/latency), followed by moonshine medium
OpenAI Realtime WS API is on the roadmap, so I might be able to compare via RIFT in the future...