|
We don't know this for sure, certainly, but given that things like social group, race and gender are fundamentally sociocultural phenomena (albeit with some physiological basis in some cases), I would assume that humans will have a considerable advantage. We are natively social beings with decades of social knowledge and learning, whereas these sorts of algorithms are at best seeing these things as epiphenomena in large datasets. Plus, we have the advantage of understanding what social cues certain speech traits directly 'index', or serve to mark. For instance, I'll bet you can picture a voice of somebody who you could clearly identify as white and male, but who would be exceedingly unlikely to have a long, bushy beard and wear a camoflauge jacket. This is not anatomical, but social, and are not coincidence, but broadcasted social information. Sure, with enough data, we might be able to pick up on these as sort of emergent stereotypes, but we're attuned to such cues through our social experience. And these things are culturally specific, perhaps moreso than a YouTube dataset would be. I view this as a similar situation to using ML for evaluating things like humor, irony, or aesthetic beauty in cloudscapes: They might be able to bootstrap a model which starts with human judgements, or cluster things in such a way that a 'funny' category emerges, but they're a ways off from understanding the categories themselves, and I think that's relevant. |