Hacker News new | ask | show | jobs
by kicat 1932 days ago
I only skimmed the paper, so I'm not claiming to know much about it, but one thing to keep in mind here is that a fair coin has a 50% accuracy using the same terminology as the headline. I'm not saying 72% is not an interesting achievement, its just that "you can do about 50% better than random chance" describes my gut feeling about how much you could actually see in someones face.
5 comments

They do note the random chance bit, and they also note that it's better than humans could judge on their own and even, surprisingly, better than judged by a personality questionnaire.

> Political orientation was correctly classified in 72% of liberal–conservative face pairs, remarkably better than chance (50%), human accuracy (55%), or one afforded by a 100-item personality questionnaire (66%).

Were the humans experts or just random people though?

The real question is whether the tool can beat a lookup table of age, race, and gender probabilities. The tool isn't going to be winning points of phrenology here. Weight, hair color, and hairstyle would also likely tell you a lot.

I don't have any particular reason to believe this tool wouldn't work, but let's not pretend it's getting their by phrenology-esque topologies of people's faces.

A randomly chosen black individual in the united states has a > 72% chance of leaning democrat. A randomly chosen hispanic individual is ~55-65% chance of leaning democrat. I don't find it crazy to imagine they've got a few other smaller features to boost it.

It further notes:

> Accuracy remained high (69%) even when controlling for age, gender, and ethnicity.

Did it? If you control for gender but not sex, you can use the difference to predict ideology. And for ethnicity, there are subethnicities that matter too - white Italian and white German have different proclivities.
How does one calculate a metric like that?
Simplistically, let’s take the above statistic “A randomly chosen black individual in the united states has a 72% chance of leaning democrat” at face value. So, a coin flip would be lower than 50-50 because someone of that race in that country does not have a 50 50 chance. So you would adjust the chance to 72-28 and compare that to the Facial recognition results. If you find that the results are the same, then you know that the Facial Recognition not picking up on anything beyond race. If the results are different, you know the FR is picking up on something in addition to race.

Really it is more complex than that, but fundamentally you try to say “how accurate can we be using just age, gender, and ethnicity” and use that as your controlled benchmark.

I understand what they're implying by "adjusted accuracy". My point is that I'm not sure that metric really makes sense, because "accuracy" isn't a particularly useful metric to begin with. It depends entirely on the sample distribution. "Always guess not fraud" will be 99.9% "accurate" for most use cases.

I'm asking what the literal metric is.

edit: and I don't think your explanation really works for accuracy, because accuracy isn't a relative measure, like, say, R2.

They explain: they tested predictions on pairs of faces of teh same gender, ethnicity and age. The result was 69% instead of 72% apparently.
Hmm, the actual phrasing is:

> The accuracy is expressed as AUC, or a fraction of correct guesses when distinguishing between all possible pairs of faces—one conservative and one liberal.

I've never seen something like this. Maybe this is a normal procedure?

But I would be worried that the number of old black conservative women would be really small. Seems a bit sketchy

By performing the analysis within each of those subgroups.
`f(x) = return 'Liberal'` will get you a great accuracy running the analysis within a subset of black women.
It says in the article that humans got just 55% (so 10% better than random chance) on the same test.
I wonder if that 55% is from mturk or other survey sites that can be somewhat questionable in terms of quality with how much people are paying attention versus maximizing their hourly survey earnings.
It says on a similar test - it's a reference to a different study with a different data set.
The humans are probably overthinking it. You get ~55% by assuming by answering "Biden" for everybody.
In fact, the dating site dataset was ~54% conservative according to their explanations of included data, but the point stands.
According to Wikipedia only 51.3% voted for Biden/Harris.
The dataset was not restricted to voters.
Do you have data that includes non-voters? I haven't seen any; most polls are limited to voters.
There were tons of national polls done for Trump's approval that included all adults (instead of likely or registered voters). Trump fared noticeably worse in the polls of all adults throughout his presidency.
72% is a very significant deviation from 50% though, I wasn't expecting such a result.

>The highest predictive power was afforded by head orientation (58%), followed by emotional expression (57%). Liberals tended to face the camera more directly, were more likely to express surprise, and less likely to express disgust.

Emotional expression makes some sense in hindsight but I wouldn't have though that head orientation would correlate. It's interesting to know how we betray ourselves with these minute details of body language.

Though, these seems a bit odd; how many people are expressing _disgust_ in dating profiles?
Those categorizations are opposites on a continuum of various kinds of muscle tension, notably brow scrunch muscles. Surprise is literally brow up and jaw slack, disgust involves brow scrunch and lip tightening. Facial expressions also bleed through to our experienced emotions, so going around with face scrunched a lot will MAKE you more suspicious and disgusted with things.
This is an important point - 72% is interesting, but its a 22% added to the chance to guess correctly... still interesting though
A 72% score on this scale is equivalent to confidently knowing 44% of the answers, and coin-flipping the rest.

It's doing something that untrained humans are not capable of [edited to add: although humans were apparently tested by a different method, so this is not properly comparable], but is still a failing grade by usual methods of assessing human knowledge.