|
|
|
|
|
by uoaei
1623 days ago
|
|
You are considering only the technical aspects of the model. While of course important to understand, those are less interesting when considering potential harms than the downstream effects of the inference pipeline, particularly when it comes to interpretations of outputs. What is absolutely the worst possible MO is to offload the interpretation portion of a pipeline to a machine using proxy metrics without an exceptional model which justifies the approach unequivocally. For instance, if we put an MSE loss function on a classification NN with sigmoid outputs, and used a classification dataset, we could generate an entire zoo of "many, many very accurate models" as measured by MSE. But once your model returns outputs, how do you interpret them to predict a label for some input data? You could hack some algorithm together (eg argmax of the highest value) which is indistinguishable from the "correct" procedure but the described probabilities are so incorrect that no ML professional would be comfortable trusting anything it says, not least because of the violation of the condition that the probabilities are non-negative and sum to one. But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to. |
|
What is the "very deep origin"? What is this "new way of thinking"? And what's so wrong with using argmax to make a classifier, if I don't care about estimating probabilities and just want the answer?