Hacker News new | ask | show | jobs
Ask HN: Need for a human-powered text-to-speech API?
3 points by leahcim 2749 days ago
We are working with a network of US-based people good on the phone.

Is anyone interested in an API that would accept text as input and return a MP3 of someone reading the text within a couple of hours? We have a couple of US-based people who could do the job really well in a couple of minutes.

Command: POST /tts { text: "Hello John. Thanks for joining us today.", voice:"female", web hook: "../webhook/response" }

Webhook response (a few minutes later): POST /webhook/response { file: "voice.mp3", cost: 0.07 }

Cost would be something like $1 per 100 words.

3 comments

> Cost would be something like $1 per 100 words.

A quick googling suggests that voice acting rates (pay to the voice actor alone) tend to be in the range of $1/second for short, small-market bits (short bits with larger markets tend to have higher use fees on top), so it sounds like this service relies on getting people willing to work on-demand for about 1/100 of market rates with a much faster turnaround time than is typical to have any room for profit

Sure, if you’ve got quality voice talent there's a huge demand for that. OTOH, if you don't have quality voice talent, why would people pay for this instead of today's commercially available machine TTS, which is much lower latency and much cheaper (e.g., Google with their premium WaveNet voices at $16/million characters, or something on the order of $1/8000 words.)

I'd wager that a latency of a couple hours is unacceptable for almost all TTS use cases.

Moreover, the current generation of TTS is pretty good and a lot of research is being done to improve it. You'd have a very finite amount of time to build your service and get users before the big players have got TTS that has caught up and doesn't have an enormous latency/require paying human wages.

Both Google and AWS have these APIs for pennies per minute. This market is going to be absolutely commoditized in no time. You thinking of using one of these APIs and slapping a front end on it?
Absolutely agree, it's a super crowded space. Question is: are you happy with existing TTS API? They sound so robotic.
Honestly, unless you have some crazy tech there is no way you can complete with Google and AWS in this space. The Google API does this in real-time too (think what is backing Google Home, etc). The new deepmind wavenet tech is getting way better at sounding natural [1]. I think your only option would be to use these APIs, slap a front end on it, and try to undercut everyone in the market (and quickly). But, it is a race to the bottom, and you likely have a brief window to make some real money. Plus, this is typically a one time purchase for most folks and not a subscription business. So, you'll constantly be chasing customers.

I explored this idea, also the speech-to-text option, and when you run the numbers you'll need thousands of hours per day just to keep the lights on. Probably not worth it given you'll constantly be tracking new customers down. One option might be to target news companies and try to make automated news castings or something and try to get consulting fees + using your custom tech. But, I suspect it would need to be the tech + some other offering to differentiate you from everyone else that will be doing this.

Not trying to dissuade you. Just telling you what I think about it after looking at this and building out a few prototypes.

[1] https://cloud.google.com/text-to-speech/docs/wavenet

I wonder if some tech companies need more human audio samples to train their ML?