Hacker News new | ask | show | jobs
by wanderfowl 2581 days ago
I'm a speech scientist. This paper is a neat idea, and the results are interesting, but not in the way I'd expected. I had hoped it would the domain of how much person-specific information this can deduce from a voice, e.g. lip aperture, overbite, size of the vocal tract, openness of the nares. This is interesting from a speech perception standpoint. Instead, it's interesting more in the domain of how much social information it can deduce from a voice. This appears to be a relatively efficient classifier for gender, race, and age, taking voice as input.

I'm sure this isn't the first time it's been done, but it's pretty neat to see it in action, and it's a worthwhile reminder: If a neural net is this good at inferring social, racial, and gender information from audio, humans are even better. And the idea of speech as a social construct becomes even more relevant.

3 comments

Do you think it's more difficult to guess physiological features from a voice or a voice from a picture?

I'm mostly deaf (cochlear implant) and one thing I've noticed is that if I watch things without my processor on (e.g., completely deaf), I can generally "guess" what a voice sounds like fairly accurately... I've wondered for a long time if it's a trick of my mind, a quirk of statistics, or something that's actually possible.

In both cases, there are a lot of hidden variables. With voice, you miss out on non-acoustic things like beards, cheekbones, and other sorts of face-distinguishing features

With just a face, you miss things like the fundamental frequency (pitch) of the voice, dialect, and other linguistic variables.

In both cases, much is missing, and impossible to reconstruct beyond a stereotype.

I think that's part of the motivation for Blindpad: https://github.com/blindpad

It's a tool for pair programming in interviews. It includes audio (no video), and alters the audio to reduce/eliminate cues that would indicate the interviewee's race, age etc.

> If a neural net is this good at inferring social, racial, and gender information from audio, humans are even better.

Why would humans automatically be better than machines at that task?

We don't know this for sure, certainly, but given that things like social group, race and gender are fundamentally sociocultural phenomena (albeit with some physiological basis in some cases), I would assume that humans will have a considerable advantage. We are natively social beings with decades of social knowledge and learning, whereas these sorts of algorithms are at best seeing these things as epiphenomena in large datasets.

Plus, we have the advantage of understanding what social cues certain speech traits directly 'index', or serve to mark. For instance, I'll bet you can picture a voice of somebody who you could clearly identify as white and male, but who would be exceedingly unlikely to have a long, bushy beard and wear a camoflauge jacket. This is not anatomical, but social, and are not coincidence, but broadcasted social information. Sure, with enough data, we might be able to pick up on these as sort of emergent stereotypes, but we're attuned to such cues through our social experience. And these things are culturally specific, perhaps moreso than a YouTube dataset would be.

I view this as a similar situation to using ML for evaluating things like humor, irony, or aesthetic beauty in cloudscapes: They might be able to bootstrap a model which starts with human judgements, or cluster things in such a way that a 'funny' category emerges, but they're a ways off from understanding the categories themselves, and I think that's relevant.

I think that's the scary thing. We don't even know if we know. It's all subconscious.

For example most people can easily picture a gender, race, age, and where a person is from based on accent.

But I never realized that I also picture how fat they are, and can do it pretty well! It wasn't until I saw that this project can do it very reliably that I realize that I do it all the time too.

What else are we subconsciously picking up on? And as a counter defense, how can we better hide it? Do I need to change my vocabulary and topic choices to something more posh so they think I am eating healthier? What other info leaks are there?

This is a bit different but also an example that made me realize I unconsciously recognize some things I'm unaware of (the difference between pouring hot and cold water): https://www.youtube.com/watch?v=Ri_4dDvcZeM
Actually I see no reason for humans to be any better for these tasks.