| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shivekkhurana 126 days ago

The TTS/STT models are actually good and aggressively priced. I personally built a voice-mode ai assistant.

STT time to first token is ~300ms. ~20 second audio takes less than 1 second to be converted.

TTS time to first token is ~700ms. ~20 second of audio is generated under 2 seconds.

1 comments

alephnerd 126 days ago

Absolutely! The TTS/STT approach that Sarvam and the other Indian firms are taking is more intuitive for a larger share of people and usecases. The "replace an SDR" or "replace a call-center" usecase is such an easy win to show POV.

I feel this is also why you don't see the same degree of hype as you would with the other players. When you are taking an application-driven approach to launching AI products, hype matters less than targeting decisionmakers and showing that your product directly aligns with their outcomes.

porridgeraisin 126 days ago

One other reason STT and OCR (checkout sarvam vision demo on their website, extremely good!) is the focus is to use it to build indian language datasets that can then be used to train larger LLMs than the current 105B one. Most training data in indian languages (you'd know, there are more than just hindi) is in either speech form, or old books.

If you add in the commercial aspect you pointed out, TTS/STT becomes even more important.