I agree with the comment above. I have not logged into hacker news in _years_ but did so today just to weigh in here. If people are saying that the audio sounds great, then there is definitely something going on with a subset of users where we are only hearing garbled words with a LOT of distortion. This does not sound like natural speech to met at all. It sounds more like a warped cassette tape. And I do not mean to slight your work at all. I am actually incredibly puzzled here to understand why my perception of this is so radically different from others!
Also keep in mind the processing time. The ^ article above used a NVIDIA L4 with 24-GB VRAM. Sopro claims 7.5 second processing time on CPU for 30 seconds of audio!
If you want to get real good quality TTS, you should check out elevenlabs.io