Hacker News new | ask | show | jobs
by nkozyra 2805 days ago
1) I'm trying to figure out what NLP has to do with this problem. This is a classic collaborative filtering "problem."

2) I think Google is acutely aware that their results are driven by human behavior and thus are biased. It's the nature of its design

1 comments

Can you explain how this is collaborative filtering as opposed to a classic IR ranking problem? CF would suggest they are somehow getting user ratings of these scientists, but either way its going to boil down to a similarity metric basically. So I guess for me, I can't imagine how user data is creating these rankings and I'm pretty confident using IR techniques on the datasets they have would not return these either, ergo, they are likely tweaking the factors themselves to return results that are "less biased" i.e. less representative of the underlying distribution and more normally distributed aka politically correct.

But If you have a better theory of how the 10 of the first 20 "american scientists" are black and 5 are women, I'd be interested to hear it.

Check Baidu: http://www.baidu.com/s?ie=utf-8&wd=%22american+scientists%22

Result #6 is the list of African-American inventors and scientists on Wikipedia. Unless Baidu has the same ideological biases as Google (would be strange), the most likely explanation is that it's driven by n-gram frequencies.

Yes, precisely that's what I would expect from an NLP system b/c it will find "African American" and, I would expect "Chinese American", etc. in documents more frequently than for a plain "American", much like what this article mentions with Banana and no one ever mentioning yellow. Still, the algorithm would have to be pretty approach would have to be pretty naive not recognize that "X-American" is a subset of "American". It would be like not recognizing that a query for "anonymous function" is something different than a query for "function".

Here's the underlying data at duckduckgo: https://duckduckgo.com/?q=american+scientists&t=h_&ia=list

I'm still interested in a possible technique which could lead to this type of bias without it being explicit (or requiring google to have an extremely naive approach).

I don't see why you can't just accept that that naive approach is their approach? Those two words almost always occur together as part of "-American scientist." This happens to work very well in general for search engines. I don't think Google or DuckDuckGo is hoping their image page for American Scientist just returns African Americans and are therefore subtly changing their algorithm to that end.
> I don't see why you can't just accept that that naive approach is their approach?

I very strongly doubt their approach is based on substring search. They're obviously using a knowledge graph. And if you try a search for "American economists" or "American philosophers" the results look much more expected, either the "American" in this case is not a substring of "African-american" or they simply thought that economy and philosophy aren't as worth of an equality boost as STEM disciplines.

> I very strongly doubt their approach is based on substring search.

You don't think Google search is using 2-grams? Do you think they're conspiring with other search engines? https://www.bing.com/images/search?q=american+scientist

Have you considered that there may not be as many African American economists as there are scientists, doctors and inventors and that is the reason the search behaves differently? Do you have any substantive basis for your claim that google is racially biasing their results?
> Yes, precisely that's what I would expect from an NLP system

> I'm still interested in a possible technique which could lead to this type of bias without it being explicit (or requiring google to have an extremely naive approach).

I don't understand. If it's what you expect, then what's left to explain?