I submitted this comment on ProductHunt too but I wanted to make sure you see it:
Looks great but FYI there's a long-standing healthcare company that's been in business for over a decade with various speech products/features named "Vocera"[0]. I'm not a lawyer but they have many trademarks on Vocera and the standard is generally "likely to cause confusion". You're probably well in that territory with a speech product that sounds almost identical and is one letter off. When Googling "voicera" Google replaces/suggests "Vocera". There's a pretty decent chance you'll be hearing from them.
I was thinking the same. I hear this company name all the time at my work, so when I saw the title, I did a double-take and thought there was a typo. They even have AI voice command, so I briefly thought it was the same company.
We use the Vocera hands free devices where I work. At first glance of this title I thought "oh, they do AI dictation stuff, too? I guess that makes sense." And then I noticed the spelling and had the same thought as you. So, ditto.
The word 'dictation' is confusing here. I think you want 'recitation', 'vocalization', 'narration' or just 'reading'. Dictation is speech-to-text, this is text-to-speech.
> Dictation is the transcription of spoken text: one person who is "dictating" speaks and another who is "taking dictation" writes down the words as they are spoken. Among speakers of several languages, dictation is used as a test of language skill, similar to spelling bees in the English-speaking world.
Anyone have a nice opensource/pretrained TTS model they like using? I use google's wavenet TTS heavily to create 'audiobooks' (especially from archive.org which is great for old books). But it's pretty expensive
I periodically look for new versions and, while the examples sound better, they fall down really hard on other text.
I'd really encourage you to invest some time into SEO and promotion of your project.
I spent a bunch of time recently looking for exactly this: TTS, offline, an Open Source licence, and with "decent"/"natural" sounding default voices.
The "best" I ended up finding was `espeak-ng` but, really, the "natural"-ness is barely comparable to what Larynx seems to produce--based on a quick listen to the demos here: https://rhasspy.github.io/larynx/#en-us
On first impressions at least, Larynx definitely seems to be a project that desires a higher profile in this space.
Thanks for sharing the project here, I'll be interested to take a deeper look when I circle back to my side-side project that could benefit from it. :)
(BTW I didn't watch/listen to the YT video all the way through yet but if the narration is generated by Larynx (which it seemed it might be?) it's definitely worth stating that up front.)
Oh, also, really appreciate that there's multiple options for non-male voices too which is something that seems to be sorely lacking in similar projects.
Yes agreed, this is great! The best I found that could generate faster than real-time without a GPU was speedyspeech (https://github.com/janvainer/speedyspeech). Unfortunately it was only trained using the LJSpeech dataset and I haven't been able to transfer to a multi-voice model. I have been using it to build an story-telling app for my kids.
So, after actually watching the demo video all the way through, it seems the video is narrated by a Larynx-produced voice "southern_english_female": https://youtu.be/hBmhDf8cl0k?t=387
While I appreciate the cinematic aspect of the confirmation/"reveal" at the conclusion of the video, ironically--because the quality is so good--since most people won't get through a 7 minute video, it would be entirely possible to not realise the narration itself is generated.
So I'd encourage you to consider stating it a little more up front.
(Also, based on the release dates it seems like this project is relatively young which explains why it's not very well known at this point.)
That is pretty nice, one of the best collection of voices I've seen and the best interface
Google gives 1 million characters per month free which I don't often go over, but this will be really useful for when I do
I don't want to be unappreciative, it's amazing that this is possible much less free, but when you spend hours listening to it every day, the cracking and warbling do get old. I think there are better models I've heard snippets of but the truly amazing thing about google's is how robust it is to very weird words
(When I tried all the public cloud offering's, IBM's was the marginally nicest AFAICT but it was the most expensive with least free quota)
Mozilla has a pretty good open-source TTS library. In general high-quality pre-trained TTS models are surprisingly hard to find--I'd also be curious to see if anyone knows any good alternatives
I don't have an answer to your question but I'm curious about something else: What kind of results would you get if you used one of the paid TTS models to generate a dataset and then trained your model on that dataset? Would it be possible to recreate their model in this way?
Cool idea, but wouldn't it be more useful as a feature for an RSS reader? No user of mine would come to the website and listen, but if they use a reader for my feed and other feeds, that would be useful for them.
I created such an app for myself two years back. It gets fatiguing very quickly to listen to TTS, even when using state of the art stuff like Wavenet Polly.
I assume there are a number of Youtube channels doing something like this already - I occasionally notice that an otherwise well-researched and presented item has a non-English idiom - such that even a fluent speaker would self-correct.
I guess the idea is to write one and just "release" it in many languages.
The point of all that is, yeah, computer generated voice has gotten to the point I need dumb mistakes to realise ... one of those "the tech has passed an inflection point" moments
And yet apps that leverage this are quasi absent. The Firefox screen reader widget is good but the voices are limited and the functionality limited to well formatted pages. E-book software seems to not integrate this tech either.
I wouldn't use something like that myself. I use NVDA full time. If you actually need a screen reader, you probably want something more advance than a browser widget.
https://www.nvaccess.org/
I don't _need_ screen readers in so far as I can still read with my eyes, but I can read content quite a lot faster with it, and finish reading stuff I otherwise wouldn't, using spoken text instead.
My hacker news client HACK has both reader mode for articles as well as read text to speech for articles and comments. It's brand new so feedback is welcome:
I think I know the type of video you mean, I always assumed they were not TTS but professional speakers hired on Fiver and obliged to speak the text verbatim even though there are weird phrases.
From a website vistor's perspective: I think an interesting direction would be to offer an asynchronous option (podcast?) for all converted content on a site. I think one reason people want to listen to something rather read is because of the additional freedom that you get when you don't have to pay attention to a screen - eg, you could be walking instead.
That said, I'm not sure what your sell to website owners is. "Engagement" is kind of a vague benefit, why would they want to use this service? What problem is it solving? It's not clear from your landing page why anyone would bother to use this - which (just my opinion) is the #1 problem you need to solve.
I made something similar (https://blogreader.com.au) a while back from the other side - where website readers can choose which blogs they want to listen to. There's also something similar to my site offered by Pocket for free (although I don't like their AI voice very much personally).
Hey Bloggers & Content Creators,
We present you Voicera, a platform that will allow bloggers and content writers to embed life-like voice dictation of their blogs directly into their content. All in one click. They can reach their busy readers via interactive blogs that can speak and increase retention and can run in background.
Extremely simple to use and that too completely FREE.A new era of blogging is here.Don't you miss out content creator.
Use KJFPIY code to get extra 2000 credits.
Honestly what's the point of these? I mean sure, if you're driving I suppose, but you probably should be concentrating on driving lol. Also, dictation is what I do with Siri.
For people who actually need TTS on a daily, regular basis like me, we have our own, so services like these are pointless.
This might be beneficial for visually impaired if the performance (closer to natural voice) is better than the default text-to-speech softwares on the OS. I think this is an important factor.
However OS makers can catch up and threaten the business model of this software by integrating a better TTS.
I wouldn't recommend using natural voices to read anything long. They definitely aren't fast enough. Maybe reading short emails and things like that, but when you get to long articles, it takes too much time.
I also think that when a voice pauses to breathe it sounds more creepy than natural, and wastes more time, rather than benefits me in any way.
A question - in the sample dictation on the site, it adds a voice annotation for "features" and "pricing categories". These weren't encoded in the HTML. How does it figure that?
Listen to the bad reading of “Let users listen to your articles while they shop, commute or do something else”. Having your text read by a dumb computer is yet another reason to use the Oxford comma.
Looks great but FYI there's a long-standing healthcare company that's been in business for over a decade with various speech products/features named "Vocera"[0]. I'm not a lawyer but they have many trademarks on Vocera and the standard is generally "likely to cause confusion". You're probably well in that territory with a speech product that sounds almost identical and is one letter off. When Googling "voicera" Google replaces/suggests "Vocera". There's a pretty decent chance you'll be hearing from them.
[0] https://www.vocera.com/