| HN Mirror

In machine learning the "bias" that relates to the bias-variance tradeoff is inductive bias, i.e. the bias that a learning system has in selecting one generalisation over another. A good quick introduction to that concept is in the following article:

Why We Need Bias in Machine Learning Algorithms

https://towardsdatascience.com/why-we-need-bias-in-machine-l...

The article is a simplified discussion of an early influential paper on the need for bias in machine learning by Tom Mitchell:

The need for bias in learning generalizations

http://dml.cs.byu.edu/~cgc/docs/mldm_tools/Reading/Need%20fo...

The "dataset bias" that you and the other poster are discussing is better described in terms of sampling error: when sampling data for a training dataset, we are sampling from an unknown real distribution and our sampling distribution has some error with respect to the real one. This error manifests as generalisation error (with respect to real-world data, rather than a held-out test set), because the learning system learns the distribution of its training sample. Unfortunately this kind of error is difficult to measure and is masked by the powerful modelling abilities of systems like deep neural networks, who are very capable at modelling their training distribution (and whose accuracy is typically measured on a held-out test set, sampled with the same error as the rest of the training sample). It is this kind of statistical error that is the subject of articles discussing "bias in machine learning".

Inductive bias has nothing to do with such "dataset bias and is in fact independent from dataset bias. Rather, inductive bias is a property of the learning system (e.g. a neural net architecture). Consequently, it is not possible to "eliminate" inductive bias - machine learning is impossible without it! The two should absolutely not be confused, they are not similar in any context and should not be interpreted as in any way similar.