Hacker News new | ask | show | jobs
by radq 5493 days ago
I'm digressing a bit here, but with regard to the last paragraph:

> we were writing a gender classifier to categorize people as males/females based on first name and last name (this forms an important part of any social media monitoring product). The most common way to achieve this is through Machine Learning approaches. Gunaa tested his algorithm on a random data set (~22000 unique names if I recall correctly) and achieved 62% accuracy on the classification (awesome!). although THAT meaning might seem less apparent from the conversation above :P

Isn't 62% ridiculously bad? I would expect a naive Bayesian classifier trained on first names to do much better than that. Am I missing something here?

2 comments

Yes, it is really bad. Turns out, the test data we were using was not clean and correctly categorized and that's why such low accuracy.
The para now, is different. Did you edit it? Its always good etiquette to write that this is an edit. Good luck with the names, and the girls.
he has definitely given a lower bound for the problem at hand. i'm sure if he had known which country do the "dataset girls" belong, precision would've been much better.