|
Is there way to get "speech marks" alongside the generated audio? FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this: {"time":6,"type":"word","start":0,"end":5,"value":"Hello"} {"time":732,"type":"word","start":7,"end":11,"value":"it's"} {"time":932,"type":"word","start":12,"end":16,"value":"nice"} {"time":1193,"type":"word","start":17,"end":19,"value":"to"} {"time":1280,"type":"word","start":20,"end":23,"value":"see"} {"time":1473,"type":"word","start":24,"end":27,"value":"you"} {"time":1577,"type":"word","start":28,"end":33,"value":"today"} AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service... The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model. https://docs.aws.amazon.com/polly/latest/dg/output.html |