|
Rebalancing an imbalanced dataset is common in industry and academicia. You use that when you focus on accuracy, to make claims like: We were 54% accurate on classifying sexuality of females easily interpretable, without needing a distribution-balanced benchmark (you simply know it is a coin flip). If there is signal in the rebalanced dataset, there should be signal in the imbalanced dataset. If they'd switched to logloss or AUC and an imbalanced dataset, do you think now their results would be as good as random? Because that is what you are implying and you are basically implying the research is fraudulent. This is a very strong claim to make, in the absence of legit discrediting studies that failed to replicate any predictability, and requires more than guessing the authors rebalancing act was "clearly" to improve the accuracy (with 7% negative class, you could get 93% accuracy by always predicting positive class, so if they wanted to inflate the accuracy, they shouldn't have rebalanced). The ethical considerations are moot/personal opinion, as they passed the ethics board of Stanford. Those are people who evaluate ethics of academic research for a living, or are you saying they were also shoddy and wrong to give this a pass? Magical thinking is not wanting something to be true, because it would be an uncomfortable truth, and so deeming that something which is objectively true, must be false, so you can continue to think happy thoughts in line with your world view. You keep talking about the paper being widely discredited, but can't provide a single academic source for this. Instead, you question my sources (business insider?) while posting articles from The Next Web written by a History degree journalist who does not want the concept of binary sexuality to be true, or even allow it in constructing a dataset of gay and straight people by self-classification. It takes more energy and letters to attack a point than to make a point. You made quite a lot of weak points. |
You quoted The Strength of Weak Learnability and I figured you must have at least a passing acquaintance with computatinal learning theory. In computational learning theory (such as it is) it's a foundational assumption that the distribution from which training examples are drawn is the same as the true distribution of the data, otherwise there cannot be any guarantees that a learned approximation is a good approximation of the true distribution.
The following is a good article on machine learning with unbalanced classes:
http://www.svds.com/learning-imbalanced-classes/
I recommend it as a starting point.
>> This is a very strong claim to make, in the absence of legit discrediting studies that failed to replicate any predictability, and requires more than guessing the authors rebalancing act was "clearly" to improve the accuracy (with 7% negative class, you could get 93% accuracy by always predicting positive class, so if they wanted to inflate the accuracy, they shouldn't have rebalanced).
The gay class was the positive class and the straight class negative, in this case. If you did what you say and identified everyone as straight, you'd get a very high number of false negatives: you'd identify every gay man and woman as being straight. You'd get very high recall but abysmall precision. The authors validated their models using an AUC curve plotting precision against recall and such a plot would immediately show the weakness of an always-say-straight classifier.
>> You keep talking about the paper being widely discredited, but can't provide a single academic source for this.
An "academic source", like a publication in a peer-reviewed journal is not always necessary. For example, you won't find any peer-reviewed work debunking Yuri Geller. In this case my instinct is that no reputable scientist would want to get anywhere near that controversy (and that was one reason I also stayed away).