| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benjismith 462 days ago

Is there way to get "speech marks" alongside the generated audio?

FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:

{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}

{"time":732,"type":"word","start":7,"end":11,"value":"it's"}

{"time":932,"type":"word","start":12,"end":16,"value":"nice"}

{"time":1193,"type":"word","start":17,"end":19,"value":"to"}

{"time":1280,"type":"word","start":20,"end":23,"value":"see"}

{"time":1473,"type":"word","start":24,"end":27,"value":"you"}

{"time":1577,"type":"word","start":28,"end":33,"value":"today"}

AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...

The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.

https://docs.aws.amazon.com/polly/latest/dg/output.html

2 comments

minimaxir 462 days ago

Passing the generated audio back to GPT-4o to ask for the structured annotations would be a fun test case.

link

jeffharris 462 days ago

this is a good solve. we don't support word time stamps natively yet, but are working on teaching GPT-4o that skill

link

celestialcheese 462 days ago

whisper-1 has this with the verbose_json output. Has word level and sentence level, works fairly well.

Looks like the new models don't have this feature yet.

link