| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by uoaei 1623 days ago
	You are considering only the technical aspects of the model. While of course important to understand, those are less interesting when considering potential harms than the downstream effects of the inference pipeline, particularly when it comes to interpretations of outputs. What is absolutely the worst possible MO is to offload the interpretation portion of a pipeline to a machine using proxy metrics without an exceptional model which justifies the approach unequivocally. For instance, if we put an MSE loss function on a classification NN with sigmoid outputs, and used a classification dataset, we could generate an entire zoo of "many, many very accurate models" as measured by MSE. But once your model returns outputs, how do you interpret them to predict a label for some input data? You could hack some algorithm together (eg argmax of the highest value) which is indistinguishable from the "correct" procedure but the described probabilities are so incorrect that no ML professional would be comfortable trusting anything it says, not least because of the violation of the condition that the probabilities are non-negative and sum to one. But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to.

2 comments

spekcular 1623 days ago

"But being able to explain why we use MSE or cross-entropy or any other loss function and which output activations (hint: and probability distributions) they are typically associated with actually has a very deep origin in the foundations of probability theory which blows open a whole new way of thinking about statistical modelling that is not made available in any of the programs whose materials I've been exposed to. "

What is the "very deep origin"? What is this "new way of thinking"? And what's so wrong with using argmax to make a classifier, if I don't care about estimating probabilities and just want the answer?

link

uoaei 1623 days ago

A lot of processes downstream to inference benefit from having a minimum of care put into the system design. We're talking 80/20 rule stuff here. It's a simple reorientation vs a janky argmax-classifier, but results in assumptions being obeyed broadly, in a max-entropy sense.

The key insight is that all prediction models can equally be framed as energy-based models (y = f(x) -> E = g(x, y)) and the job of ML is to estimate the joint distribution of x and y with suitable max-entropy surrogate distributions, and performing MLE on this variational distribution vs some training data. All the math in the theory follows from this (perhaps excluding causal stuff but actually I am not familiar enough with those techniques to say for sure). Things get a little more complicated when you consider e.g. autoencoders but above still holds.

Obviously with the choice of a poor surrogate distribution, your predictions will on average be worse. Yes, even if you don't care about probabilities and just want max-likelihood predictions -- your predictions will on average be worse. By construction, analysis proceeds by framing the problem as this and following through. A janky argmax-classifier is not exempted from this -- it, too, already implies a surrogate distribution, but you know, statistically speaking, it's probably a pretty bad one. So it makes sense to put a tiny bit more effort to get way closer to representing the space that your data lives in.

Naturally, you could easily find a janky model that outperforms some relatively unoptimized principled model on a specific use case, and many do get lucky with this. But the principled model has a lot more headroom specifically in terms of the information it can hold, because if the design is more or less correct to the problem specification then the inductive bias built into the model matches closely with the structure of the data which is observed.

link

borroka 1623 days ago

Very few of ML is "principled" (e.g., taking account the probability distributions, priors, bounds on the value of parameters etc,), actually it is most of the time a brute-force approach that makes modelers avoid "thinking" about probability distributions, transformations etc.

I did a lot of the "principled" modeling you talk about, in Stan, TMB, and JAGS back in the day, but outside of the need for an "explanation" of model behavior—which is a scientific need much more than engineering need (mind you, here not having an explanation does not need having no idea what the model does, but it relative to the relationship between x and y, both in how we reach the estimation of parameters and the interpretation of the parameters themselves)—I would almost always favor a "brutish" for prediction in industry, out of (1) convenience, (2) accuracy that's almost always better for ML models even using un-principled methods, (3) outside of proper causal inference, predictions are what matters and even when people demand an "interpretation", causality when data and model are not up for that kind of analyses, is a just a guess anyway.

link

uoaei 1623 days ago

Scientific vs engineering needs is a false dichotomy. Explanation of model behavior matters a lot in many, many matters of engineering, but my point is trying to go further.

You may be thinking narrow-mindedly about what is meant by "interpretation". Or rather, conflating "interpretation of predictions of ML system", which is the common understanding in professional circles, with "interpretation of the real system whose aspects we are predicting with ML", which is a more colloquial frame. I hold you to no fault as I have been ambiguous in my usage and the two overlap quite substantially, particularly at the outputs of the ML system.

An alleged association between homosexuality and passport photos, for instance, is an interpretation of the ways humans exist and what they are fundamentally (read: physiognomy). Automating this association encodes a specific human-level interpretation about what is true about people into the ML system. But this joint distribution between homosexuality and the way a face looks when you record a picture of it is bogus in ways that are hard to put into words. The principle is lacking completely. And this kind of system can very easily be used for extreme harm in the wrong hands.

Nevertheless, surely someone motivated would (1) consider this approach convenient, (2) would have an accurate (vs data) model after the training completes, and (3) would use the raw predictions as they think those "are what matters".

I find, not only for myself but others as well, that being aware of the technical foundations opens the space of cognition to other perspectives of thinking about these issues which find synthesis between the technical and the social impacts of design decisions.

link

spekcular 1622 days ago

Do you have a reference to a paper that demonstrates the empirical superiority of energy-based models to well-tuned "janky argmax-classifiers"? I find it a little hard to believe there's a free lunch here given the relative popularity of basic argmax stuff – if energy-based models were obviously better, it seems like they'd be used more. But I am open to evidence on this point!

link

borroka 1623 days ago

What you described seems to me pretty standard in ML and even more in statistical modeling. Maybe because I am coming from applied math and statistics.

link