Hacker News new | ask | show | jobs
by throwawaymath 2618 days ago
No, that's incorrect. Note the part of your quote which says, "and suppose this has no connection to the failure."

The point is the Siemens sensor is a superfluous correlation with turbine failure, because the underlying dataset is biased towards Siemens sensors. The scenario suggested by the author is one in which your turbine failure dataset does not match reality.

No amount of sample enlargement will correct sample bias. You have a variable which is disproportionately represented in your underlying dataset despite being independent from a collection of variables correlated to failure, and the algorithm is learning that one instead.

Real world ways this is plausible and cannot be corrected by increased sampling:

1. Your telemetry data is accurate, but your logging service providing that data is faulty and only consumes data from a subset of meaningful publishers.

2. Whoever provided this dataset fat fingered a SQL query which joined too few tables including the sensor vendors, but correctly returned only the failing turbines.

3. Your data has (unnormalized) duplicates, because more than one system is providing telemetry data for Siemens sensors without the older systems being retired.

4. You use mostly Siemens sensors, and simply didn't correct for this in your sample.

1 comments

Just to point out:

1. Not a spurious correlation - Siemens sensors are in fact associated with increased failure rates in the dataset and if you continue to sample data with the same methodology this correlation will continue. You need to fix your data collection methodology, but it's not a spurious correlation.

2. See #1.

3. See #1.

4. The original problem statement said that a low percentage of unfailed turbines used Siemens sensors, and a high percentage of failed turbines used Siemens sensors. So 'you use mostly Siemens sensors' would imply that most of your turbines have failed, which seems a little unlikely to me.