| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jacquesm 258 days ago
	I think the biggest problem with such classifiers is to actually know what is good data and what is bad data. To take a sample of the data and to recognize whether or not this dataset is a general enough representation of both true and false examples (for a binary classifier) to be able to use it to train a model. Because it isn't rare at all to have data sets that are biased 100 to 1 or more for one of the classes, which contain hints about what class the object is in that isn't in the object itself and so on. You can train until the cows come home on such data but it will never lead to satisfactory results.

1 comments

ghm2180 251 days ago

So the bias is an issue can be handled in a variety of ways, one which I know to work is to use weights on your rarer class when training. You could also use larger margins to make sure you definitely don't mis-classify the rare class at the cost of mislableling your dominant class — presuming you are ok with it. An example is when doctors order breast biopsies, it happens a lot more than the cancer itself and based on a noisy technique of physical exam.

link