Hacker News new | ask | show | jobs
by oezi 245 days ago
> generated from text via a text-to-speech

Yes, frustratingly we don't have good speech-to-text (STT/ASR) to transcribe such differences.

I recently finetuned a TTS* to be able to emit laughter and hunting for transcriptions which include non-verbal sounds was the hardest part of it. Whisper and other popular transcription systems will ignore sigh, sniff, laugh, etc and can't detect mispronounciations etc.

* = https://github.com/coezbek/PlayDiffusion

1 comments

IIRC -- the 15.ai dev was training on fan-made "My Little Pony" transcriptions, specificaly because they included more emotive clues in the transcription, and supported a syntax to control the emotive aspect of the speech.
Where can I read about this?
> During this phase, 15 discovered the Pony Preservation Project, a collaborative project started by /mlp/, the My Little Pony board on 4chan.[47] Contributors of the project had manually trimmed, denoised, transcribed, and emotion-tagged thousands of voice lines from My Little Pony: Friendship Is Magic and had compiled them into a dataset that provided ideal training material for 15.ai.[48]

From https://en.wikipedia.org/wiki/15.ai#2016%E2%80%932020:_Conce...

I had no idea this existed, the internet is amazing