Cool, it's impressive how much can it do with a short sample, although this seems like an easy way for end users to deep fake their friends / enemies saying something.
Maybe the solution is to have a randomly generated paragraph of text to read which expires in short amount of time. So you can't predict it and you don't have enough time to splice together a fake reading from something else.
The problem with any anti abuse measure is someone can create another project which does not have any of this. There are a handful of projects which can do pretty good voice synthesis right now. It would be about as easy as getting a consensus for all photo editing tools to place a watermark on the image to prevent abuse.
In the demo we specifically disallowed bulk uploads to hinder such abuses.
[1] https://github.com/coqui-ai/TTS/discussions/1036