Hacker News new | ask | show | jobs
by chimeracoder 4322 days ago
> If this is the #3 search term, the data must be thin.

I can see why one might think that, but it's actually not necessarily the case.

I'm not sure how they're defining 'highest correlation to our index' (there are a few different ways to interpret that statement), but for example, with a latent Dirichlet allocation[0][1], a word or phrase that is rare but is almost exclusively limited to a given subgroup might still end up being one of the top associations for that group.

Another example using a similar technique: For "The Real Stuff White People Like"[2] we analyzed millions of fairly lengthy profiles, and still the results yield some phrases that you might think are relatively uncommon (though not rare), but are very highly associated with certain groups. We did the same thing for sexual orientation[3], and you can see similar effects there.

One other thing to keep in mind is that search terms are usually a few words, which is much shorter than the typical OkCupid profile.

[0] https://en.wikipedia.org/wiki/Latent_dirichlet_allocation

[1] An LDA is only one possible way of analyzing data of this sort, but it's a reasonably common one so it's easy to find other examples for comparison

[2] http://blog.okcupid.com/index.php/the-real-stuff-white-peopl...

[3] http://blog.okcupid.com/index.php/gay-sex-vs-straight-sex/