Hacker News new | ask | show | jobs
by 0xab 2611 days ago
I do research in computer vision and this paper is so bad it's beyond words.

* They give the network is huge advantage: they teach it that it should say "no" 80% of the time. The training data is unbalanced (80% no vs 20% yes) as is the test data. Of course it does well! I don't care what they do at training time, but the test data should be balanced or they should correct for this in the analysis.

* They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

* They measure the wrong thing about humans. What a doctor does is they decide how confident they are and then they refer you to a biopsy. They don't eyeball it and go "looks fine" or "it's bad". They should measure how often this leads to a referral, and they'll see totally different results. There's a long history in papers like this of defining a bad task and then saying that humans can't do it.

* They have a biased sample of doctors that is highly skewed toward people with no experience. Look at figure 1. A lot of those doctors have about as much experience to detect melanoma as you do. They just don't do this task.

* "Electronic questionnaire"s are a junk way of gathering data for this task. Doctors are busy. What tells the authors that they're going to be as careful for this task as with a real patient? Real patients also have histories, etc.

I could go on. The number of problems with this paper is just interminable (54% of their images were non-cancer because a bunch of people looked at them. If people are so wrong, why are they trusting these images? I would only trust biopsies).

This isn't coming to a doctor's office anywhere near you. It's just a publicity stunt by clueless people. Please collaborate with some ML folks before publishing work like this! There are so many of us!

17 comments

Since this is a journal focused on cancer and not machine learning, I can understand why the editors would see this paper as being worthy for for publication. Unfortunately, many of the readers will read the paper uncritically.

If possible, you should write a critical response to this paper, focusing on its methodological flaws, and send it to the editors. It doesn't have to be long; critical response are usually a couple pages at most. This is likely the most effective way of removing (or at the very least, heavily qualifying) bad science from research journals.

This is a huge problem throughout science, not just ML. As scientists, we're rewarded for publishing cool new things that work, not for pointing out things that don't or for pointing out flaws in existing papers. If the point is to get people to not read one bad paper, it's just a waste of my time. Most papers are false and a lot of them should never have passed review.

If the authors actually wanted to do good ML research, they could always have reached out to a decent ML researcher who could have told them all of this. There's no shortage of us. The journal could have reached out to an ML reviewer. Why wouldn't they? But no one did, because the results look good and so they send it off to press and it's good for both the authors and the journal to have something that is hype-worthy. It's just the sad reality of modern science.

It's amazing that a similar concern is raised/discussed here just couple hours ago: https://news.ycombinator.com/item?id=19788088

Any chance we could connect over email or something?

> Most papers are false and a lot of them should never have passed review.

Do you mean this literally or is this a metaphor to illustrate the point? If you actually mean most papers are false it'd be nice to see a link on that!

John Ioannidis claims that "most published research is false" based on some rather dubious assumptions.

https://www.annualreviews.org/doi/abs/10.1146/annurev-statis...

I agree with him although the accuracy of that statement is partially based on how “published research” is defined. Operational definitions and measurement are themselves much of the problem.
How to Publish a Scientific Comment in 1 2 3 Easy Steps

http://frog.gatech.edu/Pubs/How-to-Publish-a-Scientific-Comm...

I agree that a formal comment is best although not necessarily easy. A comment on PubPeer is easier but it will probably only be seen by those with the PubPeer extension.

I do machine learning in computational biology and cancer. The issues described in the parent comment are known among experts. It’s too bad so many others don’t know or care.

Thank you for that link, it was a joy to read
I mean, if it's an interdisciplinary study, you may want to get advisers from all sides to look at it before you publish, no?
Why would you ever balance your test data? If 80/20 is the actual population distribution, the sample that forms your test set should conform to that. Balance all you want in train/validation sets, but never the test set.

Not balancing and using ROC is a terrible combo, but the metric is the problem, not the lack of artificial balance.

I agree, they should do one or the other.

The imbalance is totally artificial and objectionable though. Where's the evidence that doctors see a 80/20 split in real life? If there is going to be an imbalance they should make it reflect the actual statistics of the task that the doctors perform not some artificial number. It doesn't even reflect the statistics of the dataset they started with (which is 90/10 unblanaced).

Admittedly, the correct analysis for when the data is unbalanced is more annoying and ROC curves are easier to interpret. That's why in something like ImageNet even though the training set is imbalanced, the test set is is balanced.

Comparisons against humans are also harder when the data is imbalanced in a way that reflects the training set, not the task. Humans don't know they are supposed to say "no" 80% of the time. That rewards the machine and that isn't easy to correct (you can correct what you think about the machine results with respect to a baseline, but not what biases the humans had).

> Where's the evidence that doctors see a 80/20 split in real life?

Cause they definitely don’t. Even in a select subpopulation - say, people going to a derm for screening - you’d expect one melanoma per 620 persons screened (as per the SCREEN trial). Since most people have more than one mole for evaluation, and even those with melanoma will have multiple innocent moles... a mole count >50 triggers a referral for screening, though in more cautious docs, possibly as few as 25...

If you wanna be really generous and consider our hypothetical high risk group to have an average of 10 moles per person, that’s 6209:1, not 80:20.

Another reason to balance the test set when the train set is unbalanced is to check if lack of training data for certain classes is a problem. You would use cross-validation, but do different splits for each class. It might well turn out that certain classes are just "easy", and you don't need to find more training samples for them to get the overall accuracy up.
80/20 is not the actual population distribution though.
Do you have an explanation of why ROC is bad for unbalanced datasets? Isn't ROC unaffected by dataset imbalance?
Agreed, I have a hard time believing this person does CV research (though I suppose it could just be a hobby for them) with a statement like that. Especially calling out that they didn't balance the test set, ummm... what?
I would say your criticism is way off base. I've developed and fielded ML-based medical devices and this looks like a reasonable study that suggests they have a system worthy of further testing. There's nothing wrong with using an ROC curve here and they document the experience of the doctors, so they weren't hiding that and around 60 or so doctors had greater than 5 years experience. Also, studies like this generally don't use only biopsy-proven negatives, since that would bias the negatives towards those that were suspicious enough to biopsy. Without knowing more details than what the paper provides, I cannot say the results are valid, but I also don't see any terrible errors after a quick scan. The main weakness is probably the fact that the test set came from the same image archive used for development. As a result, there can be all sorts of biases the CNN is using to inflate its performance unbeknownst to the developers. The best way to eliminate that concern is to use a test set gathered through a different data collection effort using different clinics, but that is expensive and time consuming and not something I would do initially. This looks like a good first step and I would encourage the developers to carry it further.

EDIT: I'll add that the ratio of positives to negatives in the training set is irrelevant and in no way invalidates the study. As far as testing goes, there is always a balance you must strike in a reader study involving doctors. Ideally, you would have the exact ratio a doctor would encounter in practice, but for a screening study, that is typically impractical as you would need a huge number of cases and doctor time is expensive. A ratio of 1 positive to 4 negatives is entirely reasonable, although the doctors (particularly the less experienced ones) will almost certainly have an elevated sensitivity and reduced specificity since they will know it is an enriched set, but this is reasonable for ROC comparison purposes as it mostly just selects a different point on the doctor's personal ROC curve. Note that some studies even tell the doctors beforehand what percentage of cases are positive.

Thank you for posting this; I can see that this evaluation came very easily to you because of your experience and expertise but to me it shows how much knowledge is required to evaluate something like this. There really should be a protocol defined around this kind of study that encodes the criticisms that you make here (and others) and stops publication of this kind of thing in its tracks.
Agreed. See this paper for a reputable reference in this space: https://www.nature.com/articles/nature21056
Why can't you use ROC with an imbalanced dataset?

My understanding is the PR curve is preferable to ROC since the ROC can make it difficult to discern differences between models on imbalanced data; but the ROC is still a valid way to compare/measure models.

I work as an ML engineer, some thoughts:

The train/test data being imbalanced in the same way does give the model an advantage, but I don't think that making the test set 50% would solve the issue completely either. Doctors have been "trained" on the true distribution, while which is not 50% (I'd guess that the true distribution is actually extremely unbalanced).

The model isn't simply learning to predict no 80% of the time, it is learning the distribution of the data with respect to the input features. For example, let's say that we have a simple model with only 3 binary features. It may learn that when features X_0, X_1 and X_2 are 1, the probability of cancer is 70%. This isn't a simple multiplication of the true probability by the upscaling factor though--it depends on the percent of negative samples with this feature vector and the percent of positive samples with this feature vector.

If we are to change the test set to be 50% positive and keep the same train distribution, the model no longer has the correct information about cancer rates with respect to feature distributions, but neither does the dermatologist. The specificity and sensitivity continue to not be interpretable as predicted specificity and sensitivity in the real world.

There is no issue with reporting specificity/sensitivity if they had used the true distribution of cases. Yes, the curves/AUCs will look better than the precision/recall rates, but they do not mis-represent what the doctors are interested in (what percent of people will be missed, and what percent of healthy people will be subjected to unnecessary procedures).

Anyways, the classifier doesn't actually seem to be that good, there's actually doctors that were better than the classifiers if you check the paper.

Sensitivity and recall are two names for the same thing, Mr Stats 101 :)

Also, please explain the problem with using ROC here. The probabilistic interpretation of ROC's AUC is the probability of correctly ranking a random mixed pair (i.e. ranking the positive example higher than a negative one). How is that metric affected by the 80/20 split of the test data? Genuinely curios here...

It does not matter whether the data is balanced or not when you report ROC (AUC), sensitivity and specificity for the purpose of comparison of two ways of image interpretation (e.g. humans vs. machines) as long as the evaluation is done on the same dataset with the same methodology. Obviously, the absolute numbers would not mean much outside of the study.
> test data should be balanced or they should correct for this in the analysis.

Why should it be balanced? It should be the expected natural clinical class distribution, no? The humans have priors about this too. If anything, it should be more imbalanced, as I would guess (I would hope!) that less than 20% of scans are malignant.

Very useful, thanks for this level of critique.

I wish they added this context in the limitations section. The paper only says:

"There are some limitations to this system. It remains an open question whether the design of the questionnaire had any influence on the performance of the dermatologists compared with clinical settings. Furthermore, clinical encounters with actual patients provide more information than that can be provided by images alone. Hänßle et al. showed that additional clinical data improve the sensitivity and specificity of dermatologists slightly [5]. Machine learning techniques can also include this information in their decisions. However, even with this slight improvement, the CNN would still outperform the dermatologists."

Your points hit on validity issues. Where would it fit on the errors of omission/commission scale?

While I agree that there are problems with the paper, I think you are confused about suitability of ROC, PR and how test set class imbalance affects them.

Your first two suggestions combined together are very wrong. If you made the test dataset balanced and then measured PR curve the precision would be way too optimistic as it is directly affected by the class imbalance. ROC curve on the other hand is invariant to the test set imbalance.

You can find interesting this short article I have written about this problem: https://arxiv.org/abs/1812.01388

> * They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

A̶F̶A̶I̶K̶,̶ ̶a̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶c̶a̶n̶ ̶b̶e̶ ̶m̶i̶s̶l̶e̶a̶d̶i̶n̶g̶ ̶f̶o̶r̶ ̶a̶n̶ ̶i̶m̶b̶a̶l̶a̶n̶c̶e̶d̶ ̶d̶a̶t̶a̶s̶e̶t̶,̶ ̶b̶u̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶o̶k̶a̶y̶ ̶f̶o̶r̶ ̶s̶e̶l̶e̶c̶t̶i̶n̶g̶ ̶m̶o̶d̶e̶l̶s̶.̶ Edit: This is incorrect, a PR curve + PR AUC should be used for model selection if imbalanced. I agree it would be really misleading if they (say) just reported accuracy (since the null classifier of always guess negative would give 80% overall accuracy). I̶ ̶t̶h̶o̶u̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶f̶o̶r̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶s̶h̶o̶u̶l̶d̶ ̶s̶t̶i̶l̶l̶ ̶b̶e̶ ̶a̶ ̶v̶a̶l̶i̶d̶ ̶m̶e̶a̶s̶u̶r̶e̶ ̶s̶i̶n̶c̶e̶ ̶i̶t̶'̶s̶ ̶s̶h̶o̶w̶i̶n̶g̶ ̶h̶o̶w̶ ̶m̶u̶c̶h̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶h̶e̶ ̶m̶o̶d̶e̶l̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶t̶h̶a̶n̶ ̶r̶a̶n̶d̶o̶m̶ ̶g̶u̶e̶s̶s̶i̶n̶g̶.̶

How do you usually handle imbalanced data? I've had some success with SMOTE or weighted loss for imbalanced datasets, but I'm embarrassed to say I've been using AUC with ROC curves as the default - if this gives inferior model selection than AUC with PR curve I'll have to start doing that instead.

Thanks for the comments, this is a great summary. Curious what you'd think of a Kappa score given the imbalance?

https://en.wikipedia.org/wiki/Cohen%27s_kappa

>you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve

But sensitivity and recall are the same thing...

There is nothing wrong with using ROC for imbalanced data. It is also perfectly reasonable to use an enriched dataset for a reader study, this is the standard practice.
It's almost as if publishing the thing was more important for the authors than the scientific value of the content.