| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stephensonsco 2744 days ago

Best to say "yes! but only some of the time". It's something we're working on right now. You can be 80% accurate, by some metric, but it's still not good enough usually to pass a human's sniff test. Good speaker labeled audio in various settings is hard to find.

There are several ways to look at this problem too.

L1: exact speaker is known (voiceprint) and can be picked from all humans with accuracy, even when others are talking L2: exact speaker is known from a subset of people, even while talking in a conversation with others L3: speaker1,2,3,... are identified accurately L4: speaker changes are identified accurately

L1 is a really hard problem. L2 is fine if you don't care about the time domain (knowing exactly when they spoke), but is harder if you have to accurately detect changes. L3 is about as hard as L2 but the big goal isn't who anymore, it's when. And L4 is easier, kinda like putting line breaks in when human transcribing a file. Not too bad. All of them need better data sources.