Hacker News new | ask | show | jobs
by ganfortran 3368 days ago
> How difficult

Very difficult:

1.Machine learning is pretty data dependent, and make those datasets are very expensive. Google is not likely to give them away for free, because it is their competitive advantage.

2.The infrastructure to train those models are hard to get outside of Google. Pretty sure it is 10s or 100s of GPUs, with Infinity Band connected PS server, running for days and weeks.

Even with source code published, people will still have to scratch their head to duplicate Google's performance. Until the day, some equivalent organization as GNU that democratize data access to the public and some mighty algorithm being discovered dramatically reduced the computational requirement for training those models, Google succeeds by just being Google is unlikely going to change.

3 comments

1. If you can pay for around 24.6 hours of VA speech data, you can get enough data to run this process with the same quality that Google presented. (that's from the "Experiments section") Not cheap (definitely not free, especially considering the amount of quality control you have to apply), but not expensive either.

2. You can rent out a 96GB GDDR5 GPU instance from Google's cloud for pretty cheap. (https://cloud.google.com/compute/docs/gpus/) I don't think you need anything more powerful than that (but feel free to prove me wrong).

I think your last paragraph is totally misguided/uninformed. You can download models for cheap/free (for non-commercial/edu use) from UPenn (https://www.ldc.upenn.edu/language-resources/data/obtaining). People don't give away models for free with 0 strings attached because they're a pain to make.

And if you want something you can run on a home computer for cheap/free, you can try DeepSpeech: https://github.com/mozilla/DeepSpeech. All you need is an Nvidia GPU.

How about feeding it several audiobooks read by a single narrator, coupled with the books in text? Cost would be < $100. There could be legal problems if you tried to sell the resulting voice, but as proof of concept, wouldn't this work?
1.What is your estimate then? Hundreds Dollars? Thousands? 10 Thousands? Surely it might fall into the later two category, since paying a professional speaker to sit and work for 26.4 hours is already over 1000 dollars, if you assume 40$ per hour wage.

2.96GB GDDR5 instances on GCE costs 4166.4 dollars per month. Though it is within affordable range, but definitely not CHEAP. I don't know whether this is powerful or not, but Google used 96 GPUs for their GNMT work. Thus, I don't think I have the confidence to say a 4-GPU machine is all you need, and it will surely cause much more if you go beyond that.

You're describing the future : 100% capitalistic society... I'm not sure I like it...
Right, even a good diphone voice needs lots of data. And I noticed they trained it with the existing Google Home voice actress, from whom they must already have many, many hours of recordings. I was mostly asking about the model itself; whether you could download TensorFlow and put one together based on this paper alone.
I see your points. But it is related. Even if u get what you think the paper describes, it is hard to know whether you did it right or not, because you cannot replicate the result easily. This happens in a lot of CV papers already, where people reimplement the model, but it never get as good as the paper demonstrated

But, you have a very good idea. Since it is Google Home, will it be possible that some people just buy hundreds of them, and infinitely ask them question to gather the training data? That will be interesting to see.