Hacker News new | ask | show | jobs
by zhte415 3307 days ago
Not at all. You're assuming first names and last names are independent.

For example, meet a 'Jones'. There's a good probability that his first names is 'Thomas'.

Meet a 'Thomas'. There's a low probability his last name is 'Jones'.

That's because 'Jones' is predominantly Welsh in origin, and 'Thomas' is a pretty liked given name in Wales. This of course would be more true in Wales than people of Welsh ancestory living far from Wales. And perhaps despite and because of the great singer Tom Jones, this combination may have fallen over the past few decades.

However, my family name is pretty location-specific in the UK, and even the diaspora of the name that went to places like North America tended to keep up traditional, albeit 2-3 centuries later.

Another example of non-independence would be a name like 'Ahmed' as given name and 'Zhang' as family name. 'Zhang' is an extremely common family name on a global scale, as is 'Ahmed' as a given name. However the possibility of 'Ahmed' and 'Zhang' overlapping as a combination is slim. Perhaps it could happen in Singapore or Malaysia, but then even 'Zhang' is probably converted to a Hokkian/Hakka/Cantonese equivalent spelling, which is not 'Zhang'. Given the scale of these names, I'm sure there 'Ahmed Zhang's knocking around, but probably not that many.

The great thing about statistics is it is about discovery, not assumptions.

And assuming everything is nice easy math, independent, or stochastic, is one of the greatest mistakes we can all make when looking at numbers.

1 comments

So, you're actually saying something like P(N_l = L | N_f = F) or P(N_f = F | N_l = L) is (for those sub-populations) high, while P(N_l = L, N_f = F) is still globally not common. That makes more sense, but is a different statement - the combined names are common within a local population but still uncommon globally.