Hacker News new | ask | show | jobs
by DuskStar 2610 days ago
Well, I might care about predicting the next turbine to fail. If Siemens sensors are truly unrelated to the issues, that'll average out eventually - but I'd be highly skeptical of someone asserting that it's completely unrelated to the failures and not just covarying with something we're not using as a model input.

Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.

2 comments

Next turbine to fail means you sample based on time points, so you still could have sample bias.

Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.

Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.
You keep saying that I have a sampling bias, but there really isn't any evidence for that. I'm sampling 100% of the population. You can't have a sampling bias when sampling 100% of the population.

It could be a spurious correlation, sure - but that'll go away as the amount of data increases.