About their TTS system: "These models provide speech synthesis with ~0.12 real-time factor on a GPU and ~1.02 on a CPU." The quality of the samples is really impressive but, wow, but isn't this computationally too expensive for many applications?
>If, for example, it takes 8 hours of computation time to process a recording of duration 2 hours, the real time factor is 4. When the real time factor is 1, the processing is done in real time. It is a hardware-dependent value.
I think real-time factors smaller than 1 are faster than real-time (not slower) and use less than 100% of a resource's computational power to keep up.
Not sure what you're quoting because I didn't write that, but
> I think real-time factors smaller than 1 are faster than real-time (not slower) and use less than 100% of a resource's computational power to keep up.
Sure, but who has the necessary GPUs installed? And on CPUs it will apparently take longer to generate speech than the duration of that speech. Unusable for many UIs and it will also drain the batteries of any portable device.
You're not wrong, but with so many chips incorporating some sort of dedicated "AI" or "tensor" functionality, perhaps the issue will resolve itself for most portable devices in a few years. Plus there's always the option of optimizing a little more and/or abusing other available hardware such as DSP chips to get the real time factor down. Anything over 1 isn't great, but it's not a bad start.