|
|
|
|
|
by stephensonsco
2744 days ago
|
|
Best to say "yes! but only some of the time". It's something we're working on right now. You can be 80% accurate, by some metric, but it's still not good enough usually to pass a human's sniff test. Good speaker labeled audio in various settings is hard to find. There are several ways to look at this problem too. L1: exact speaker is known (voiceprint) and can be picked from all humans with accuracy, even when others are talking
L2: exact speaker is known from a subset of people, even while talking in a conversation with others
L3: speaker1,2,3,... are identified accurately
L4: speaker changes are identified accurately L1 is a really hard problem. L2 is fine if you don't care about the time domain (knowing exactly when they spoke), but is harder if you have to accurately detect changes. L3 is about as hard as L2 but the big goal isn't who anymore, it's when. And L4 is easier, kinda like putting line breaks in when human transcribing a file. Not too bad. All of them need better data sources. |
|