> Given <text, audio> pairs, the model can be trained completely from scratch with random initialization.