Hacker News new | ask | show | jobs
by TheCoelacanth 2608 days ago
If the training data is all gas turbines that you own, why do you care about having the ML model at all? You already have complete knowledge of the state of all your gas turbines.

There's no point to having an ML model unless you are applying it to something outside of the training data.

If you plan on applying the model to different turbines, then there is potential for sample bias in which turbines you selected. If you apply it to the same turbines at some point in the future, then you sampled points in time so there is a potential for sample bias based on which points in time you selected.

There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.

1 comments

Well, I might care about predicting the next turbine to fail. If Siemens sensors are truly unrelated to the issues, that'll average out eventually - but I'd be highly skeptical of someone asserting that it's completely unrelated to the failures and not just covarying with something we're not using as a model input.

Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.

Next turbine to fail means you sample based on time points, so you still could have sample bias.

Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.

Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.
You keep saying that I have a sampling bias, but there really isn't any evidence for that. I'm sampling 100% of the population. You can't have a sampling bias when sampling 100% of the population.

It could be a spurious correlation, sure - but that'll go away as the amount of data increases.