Hacker News new | ask | show | jobs
by pyryt 871 days ago
Knowing when to speak is actually a prediction task in itself. See eg https://arxiv.org/abs/2010.10874

Would be indeed great to get something like this integrated with whisper, LLM and TTS

2 comments

Hard for me to imagine that this could be solved in text space. I think the prediction task needs to be done on the audio.
We thought about doing this in Whisper itself, since its already working in the audio space.
Yes, this is something we want to look into in more detail, really appreciate sharing the research.