Hacker News new | ask | show | jobs
by k9294 107 days ago
You can test Gemini 3.1 Lite transcription capabilities in https://ottex.ai — the only dictation app supporting Gemini models with native audio input.

We benchmarked it for real-life voice-to-text use cases:

                <10s    10-30s   30s-1m    1-2m    2-3m
  Flash         2548     2732     3177     4583    5961
  Flash Lite    1390     1468     1772     2362    3499
  Faster by    1.83x    1.86x    1.79x   1.94x   1.70x

  (latency in ms, median over 5 runs per sample, non-streaming)
Key takeaways:

- 1.8x faster than Gemini 3 Flash on average

- ~1.4 sec transcription time for short to medium recordings

- ~$0.50/mo for heavy users (10h+ transcription)

- Close to SOTA audio understanding and formatting instruction following

- Multilingual: one model, 100+ languages

Gemini is slowly making $15/month voice apps obsolete.

2 comments

You know what would be great? A light weight wrapper model for voice that can use heavier ones in the background.

That much is easy but what if you could also speak to and interrupt the main voice model and keep giving it instructions? Like speaking to customer support but instead of putting you on hold you can ask them several questions and get some live updates

It's actually a nice idea - an always-on micro AI agent with voice-to-text capabilities that listens and acts on your behalf.

Actually, I'm experimenting with this kind of stuff and trying to find a nice UX to make Ottex a voice command center - to trigger AI agents like Claude, open code to work on something, execute simple commands, etc.

Can you show some comparisons for WER and other ASR models? Especially for non english.
I've been experimenting with Gemini 3.1 Flash Lite and the quality is very good.

I haven't found official benchmarks yet, but you can find Gemini 3 Flash word error rate benchmarks here: https://artificialanalysis.ai/speech-to-text/models/gemini — they are close to SOTA.

I speak daily in both English and Russian and have been using Gemini 3 Flash as my main transcription model for a few months. I haven't seen any model that provides better overall quality in terms of understanding, custom dictionary support, instruction following, and formatting. It's the best STT model in my experience. Gemini 3 Flash has somewhat uncomfortable latency though, and Flash Lite is much better in this regard.