An interesting reflection is how quickly research around TTS/STT has progressed. I remember reading [0] thinking we were a long ways away. And things will get way better with multi-task learning and multi-modal learning in the coming years (or months really).
In fact, just a year after this post was written, CoquiAI started their open source projects [1].