|
|
|
|
|
by nowittyusername
148 days ago
|
|
I have been working on playing around with over 10 stt systems in last 25 days and its really weird to read this article as my experience is the opposite. Stt models are amazing today. They are stupid fast, sound great and very simple to implement as huggingface spaces code is readily available for any model. Whats funny is that the model he was talking about "supertonic" was exactly the model I would have recommended if people wanted to see how amazing the tech has become. The model is tiny, runs 55x real time on any potato and sounds amazing. Also I think he is implementing his models wrong. As he mentions that some models don't have streaming and you have to wait for the whole chunk to be processed. But that's not a limit in any meaningful way as you can define the chunk. You can simply make the first n characters within the first sentence be the chunk and process that first and play that immediately while the rest of the text is being processed. ttfs and ttfa on all modern day models is well below 0.5 and for supertonic it was 0.05 with my tests..... |
|