Hacker News new | ask | show | jobs
by ComplexSystems 49 days ago
Good article, but

"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."

Why is this a "pseudo-probability distribution?"

2 comments

Mathematically, it is literally a probability distribution, because it fits the definition of a measure whose total mass is one, so I think the language is just imprecise. What they may be trying to say is that semantically it doesn't arise in a principled way from an uncertainty model, such as from Bayesian or frequentist statistics.
Hogwash. If you get into deriving maximum entropy distributions via the calculus of variations, the multinomial is the maximum entropy distribution among categorical distributions.

This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.

Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.

That definitely could be the case. I was also a bit surprised by what the article said, so I was simply trying to interpret it, but I'm not extremely well versed in ML so I could be missing some details. My main point was that contrary to what the article said, they do in fact have a probability distribution on their hands.
This is literally the probability distribution ML models are trained on.

https://docs.pytorch.org/docs/2.11/generated/torch.nn.CrossE...

You have a relatively small dictionary of tokens, each prediction has a neural network score that goes into the final token prediction layer, and they are trained based on a log-softmax (i.e. the above function) to predict their next token.

This is exactly how anyone in any field does conditional multinomial/categorical (i.e. one of a bunch of distinct tokens) distributions, and AFAIK what LLMs generally use as their loss functions on the output layer, though I have not deeply investigated all of them, since this has been how you do that since time immemorial.

I am extremely confused by all of the people screaming it's not a probability distribution?!?!?

I have seen computer vision tasks use binomial training objectives (one-vs-all) and then use the multinomial only at inference time, and that could be fair that that is not a probability distribution induced by training (while technically a probability distribution only in the sense it is \ge 0 and sums to 1).

But afaik token prediction LLMs that I am aware of use the softmax for the probability in their loss function, i.e. the maximize log softmax.

The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.

[1] https://en.wikipedia.org/wiki/Probability_space

Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.
The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.

The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.

The softmax, after the network has been trained, yields an estimate of the probability in the training data, but it is not that probability itself.
Which models are not trained with the log softmax as the loss function?
Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax
iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)