Hacker News new | ask | show | jobs
by benjismith 462 days ago
Is there way to get "speech marks" alongside the generated audio?

FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:

{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}

{"time":732,"type":"word","start":7,"end":11,"value":"it's"}

{"time":932,"type":"word","start":12,"end":16,"value":"nice"}

{"time":1193,"type":"word","start":17,"end":19,"value":"to"}

{"time":1280,"type":"word","start":20,"end":23,"value":"see"}

{"time":1473,"type":"word","start":24,"end":27,"value":"you"}

{"time":1577,"type":"word","start":28,"end":33,"value":"today"}

AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...

The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.

https://docs.aws.amazon.com/polly/latest/dg/output.html

2 comments

Passing the generated audio back to GPT-4o to ask for the structured annotations would be a fun test case.
this is a good solve. we don't support word time stamps natively yet, but are working on teaching GPT-4o that skill
whisper-1 has this with the verbose_json output. Has word level and sentence level, works fairly well.

Looks like the new models don't have this feature yet.