It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.
I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).
A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?
That has been possible for years, and is even a typical student assignment in speech processing courses. A quick search gave this example course at Cornell
Afaik VoCo isn't creating anything from thin air, instead it scans the available voice data (it reportedly needs a sample of about 20 mins of a person speaking) and copies fragments of it in specific order to create a sentence.