Hacker News new | ask | show | jobs
by TheCoelacanth 2608 days ago
Only if your test data is free of sample bias.

Given how incredibly hard it is to avoid sample bias, you can't take it for granted that your training data doesn't have any sample bias.

1 comments

If the sample is "all the gas turbines I own", I don't particularly CARE about the bias...
If the training data is all gas turbines that you own, why do you care about having the ML model at all? You already have complete knowledge of the state of all your gas turbines.

There's no point to having an ML model unless you are applying it to something outside of the training data.

If you plan on applying the model to different turbines, then there is potential for sample bias in which turbines you selected. If you apply it to the same turbines at some point in the future, then you sampled points in time so there is a potential for sample bias based on which points in time you selected.

There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.

Well, I might care about predicting the next turbine to fail. If Siemens sensors are truly unrelated to the issues, that'll average out eventually - but I'd be highly skeptical of someone asserting that it's completely unrelated to the failures and not just covarying with something we're not using as a model input.

Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.

Next turbine to fail means you sample based on time points, so you still could have sample bias.

Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.

Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.
You keep saying that I have a sampling bias, but there really isn't any evidence for that. I'm sampling 100% of the population. You can't have a sampling bias when sampling 100% of the population.

It could be a spurious correlation, sure - but that'll go away as the amount of data increases.

You really should. If the sample is "all the gas turbines you own" and you disproportionately use Siemens sensors, your turbine failure forecast will (with high likelihood) reduce to a Siemens sensor forecast. This is easily plausible even if your sample's correlation between Siemens sensors and gas turbines is completely superfluous.
You can't have a sampling bias when 'sampling' the entire population, because the definition of 'sampling bias' includes 'some members are not included in the sample'.
Precisely, yes. I'm talking about a sample including all representative gas turbine failures, across all sensor vendors.
You can't make predictions when sampling the entire population.
you should, because you might make worse decisions for the business, for the system or for the people that are impacted by the system. If you don't have the right data to decide, don't decide using the data.