Hacker News new | ask | show | jobs
by eanzenberg 2608 days ago
>>Now, suppose that 75% of the bad turbines use a Siemens sensor and only 12% of the good turbines use one (and suppose this has no connection to the failure). The system will build a model to spot turbines with Siemens sensors. Oops.

Given a statistically large enough sample, 2 outcomes: 1) The Siemens sensor actually is at fault. 2) The Siemens sensor is a part of a larger system, which is different in non-Siemens turbines, and that system is failing.

Either way, the model prediction on turbine failures is enhanced with that Siemens feature. But to even get to this granularity, you are diving into model explainability, or what features were important for each prediction. Here, you try to understand the black-box to find reasons for particular input->output.

3 comments

I think you assume here that the historical effects that led to Siemens sensors correlating with failure will continue to be true in the future. And I think that is the key fallacy that makes AI bias a problem.

We aren't just looking for patterns. We are looking for patterns so that we can take action and affect the future. If the patterns, which are real enough in the historical data, don't correctly predict the impact of a choice, then they are anti-helpful bias.

For example, it may be that the company bought Siemens sensors years ago and then switched to another brand later. Unsurprisingly, older turbines fail more than newer ones. So, really, it's age that is the causative factor and the concrete action you want to take is to pay closer attention to older turbines. Even though the correlation to Siemens is real, if the action you take is "replace all the Seimens sensors with another brand", that won't make those old turbines work any better.

In other words, understanding data doesn't just mean "see which bits are correlated with which other bots". In order to be useful, we need to understand which changes to those bits in the future will be correlated with which desired outcomes. Anything less than that and you don't yet have information, just data.

> I think you assume here that the historical effects that led to Siemens sensors correlating with failure will continue to be true in the future.

Yes, AI systems presume induction to be true. But so does... uh, science and most other things we do?

Science has trained experts thinking about the data.

If you set a team of scientists to find a way of predicting failure of turbines, they might notice a correlation between Siemens sensors and failure. They would then look for and attempt to prove theories to explain this descrepency. In doing so, they would likly discover that, not only can they not find a causative theory, but the correlation goes away when they control for age.

AI systems stop after the first step, yet somehow are perceived as better than expert humans.

That's an interesting way to frame it. AI may stop at proximate causes rather than finding root causes
Or: AI shows correlation which we then implicitly treat as causation.
No, that's incorrect. Note the part of your quote which says, "and suppose this has no connection to the failure."

The point is the Siemens sensor is a superfluous correlation with turbine failure, because the underlying dataset is biased towards Siemens sensors. The scenario suggested by the author is one in which your turbine failure dataset does not match reality.

No amount of sample enlargement will correct sample bias. You have a variable which is disproportionately represented in your underlying dataset despite being independent from a collection of variables correlated to failure, and the algorithm is learning that one instead.

Real world ways this is plausible and cannot be corrected by increased sampling:

1. Your telemetry data is accurate, but your logging service providing that data is faulty and only consumes data from a subset of meaningful publishers.

2. Whoever provided this dataset fat fingered a SQL query which joined too few tables including the sensor vendors, but correctly returned only the failing turbines.

3. Your data has (unnormalized) duplicates, because more than one system is providing telemetry data for Siemens sensors without the older systems being retired.

4. You use mostly Siemens sensors, and simply didn't correct for this in your sample.

Just to point out:

1. Not a spurious correlation - Siemens sensors are in fact associated with increased failure rates in the dataset and if you continue to sample data with the same methodology this correlation will continue. You need to fix your data collection methodology, but it's not a spurious correlation.

2. See #1.

3. See #1.

4. The original problem statement said that a low percentage of unfailed turbines used Siemens sensors, and a high percentage of failed turbines used Siemens sensors. So 'you use mostly Siemens sensors' would imply that most of your turbines have failed, which seems a little unlikely to me.

Only if your test data is free of sample bias.

Given how incredibly hard it is to avoid sample bias, you can't take it for granted that your training data doesn't have any sample bias.

If the sample is "all the gas turbines I own", I don't particularly CARE about the bias...
If the training data is all gas turbines that you own, why do you care about having the ML model at all? You already have complete knowledge of the state of all your gas turbines.

There's no point to having an ML model unless you are applying it to something outside of the training data.

If you plan on applying the model to different turbines, then there is potential for sample bias in which turbines you selected. If you apply it to the same turbines at some point in the future, then you sampled points in time so there is a potential for sample bias based on which points in time you selected.

There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.

Well, I might care about predicting the next turbine to fail. If Siemens sensors are truly unrelated to the issues, that'll average out eventually - but I'd be highly skeptical of someone asserting that it's completely unrelated to the failures and not just covarying with something we're not using as a model input.

Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.

Next turbine to fail means you sample based on time points, so you still could have sample bias.

Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.

Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.
You keep saying that I have a sampling bias, but there really isn't any evidence for that. I'm sampling 100% of the population. You can't have a sampling bias when sampling 100% of the population.

It could be a spurious correlation, sure - but that'll go away as the amount of data increases.

You really should. If the sample is "all the gas turbines you own" and you disproportionately use Siemens sensors, your turbine failure forecast will (with high likelihood) reduce to a Siemens sensor forecast. This is easily plausible even if your sample's correlation between Siemens sensors and gas turbines is completely superfluous.
You can't have a sampling bias when 'sampling' the entire population, because the definition of 'sampling bias' includes 'some members are not included in the sample'.
Precisely, yes. I'm talking about a sample including all representative gas turbine failures, across all sensor vendors.
You can't make predictions when sampling the entire population.
you should, because you might make worse decisions for the business, for the system or for the people that are impacted by the system. If you don't have the right data to decide, don't decide using the data.