Hacker News new | ask | show | jobs
by rahulnair23 1804 days ago
From the full paper[1]:

> All models were trained on 70% of the data and tested on the remaining 30% of the data. Note that each month record for each participant was considered an independent data point.

I'm almost certain that this is a mining leak.

Data from one patient will end up both in the training and test set and result in fantastic accuracy. Of course it will be from different months. The correct way to do this is to cross-validate/split across the population. It seems unlikely, from this description, that the authors have done so.

[1] https://alzres.biomedcentral.com/articles/10.1186/s13195-021...

2 comments

Definitely.

It's a very common error and it should be easy to catch but....I've even seen a study that treated individual slices of an MRI as independent, which is laughably wrong.

I think part of the problem is that the "analysts" are increasingly uninvolved in the data collection, and just treat it as a tuple of (X, y). If you thought about what they mean, even for a second, ("Oh, Mr. Smith is always an awful driver"), the problem is obvious.

I'm somewhat unfamiliar with the problem, do you think you could explain why this is bad? Or maybe just point me in the right direction? Thanks!
One way to tell if a Machine Learning model is any good is to see how it does on unseen/new patients.

Of course, we don't wait to try it on real patients, so typically you'd partition the data you have already into (a) what you show to the machine learner (training data), and (b) what you hide from the learner (test data). The latter is only used to evaluate, i.e. you get the answer from the ML model and compare it to the real answer you have already. If information about the test data some how makes it to the training data, its referred to as a mining leak [1].

In this paper, they treat each month of a patient as an independent observation. However, GPS driver behaviour will be very similar from one month to the next for the same person. Genetic information is exactly the same. So for every month that the model is tested (test data), the learner has already seen very similar data in the training set - for some of the other months (for the same person) that happen to be in the training set. The split is typically done randomly. So it will do well.

The test results are therefore optimistic and do not support the conclusions.

[1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)

Suppose that you have 3 data points, on June 14, 15, and 16, that due to personal driving quirks all appear to belong to the same person. If the 14th and 16th are in your dataset, and both correspond to Alzheimer-free Bob, that may be a strong hint that the data from the 15th is also Alzheimer-free.

But this doesn't help you in the real world where you won't necessarily have near neighbors corresponding to the same person, with a known diagnosis.

Overfits for study participants. Will Not necessarily give same results on gen pop