Hacker News new | ask | show | jobs
by boto3 3245 days ago
There's a reason for that. NB needs to fit more params than LR (and other linear discriminant learners), e.g. for binary classification, NB needs to fit 2*N params while LR just N.
1 comments

Not true. Let's derive it. Here's NB for a binary classification:

    p(c|w)*p(w) = p(w|c) * p(c)

    p(c|w) = p(w|c) * p(c) / p(w)

    p(c|w) = p(w|c) * p(c) / [sum_i p(w|c_i) * p(c_i)]
Let's look at the probability of class 1.

    p(c_1|w) = p(w|c_1) * p(c_1) / [sum_i p(w|c_i) * p(c_i)]
Notice how the numerator is going to show up in the denominator. We can simplify that by bringing it into the denominator:

    p(c_1|w) = 1 / [sum_i p(w|c_i) * p(c_i) / p(w|c_1) / p(c_1]
Then cancel it out:

    p(c_1|w) = 1 / [1 + p(w|c_0) / p(w|c_1) * p(c_0) / p(c_1)]
Not let's apply the NB assumptions:

    p(w|c_0) / p(w|c_1) = prod_i p(w_i | c_0) / p(w_i | c_1)
Now, if you take the log of the final p(c_1|w) I derived, the product in p(w|c_0) / p(w|c_1) turns into a exponentiated sum, giving you one parameter per word, plus an intercept for the log of p(c_0) / p(c_1). You end up with exactly the same 1/(1+exp(linear stuff)) with the same parametrization and form you have in logistic regression [0].

This is a broader thing that a given graphical model with a given fixed parametrization can be generatively or discriminatively trained. They will end up learning different models, but that's because they make different assumptions, not because of different parametrizations.

[0] linear stuff = intercept + sum_i (coefficient of word_i)

I may be missing something, but how do you take a log on the RHS of p(c_1|w)? The denominator has a sum i.e. "1+ something", which the log doesn't distribute over.
Heh, nope you're right I was writing to quickly. I meant to say to take a log of of the probability term in p(c_1|w). So you take p(c_1|w) = 1/(1+stuff) and turn it into p(c_1|w) = 1/(1+exp(log(stuff))). stuff is a product, so log(stuff) is a sum. Good catch.
Oh, I see it now. Thanks for clarifying!

In that case, you needn't have kept the denominator around at all. Since P(w) is not based on a class i.e. its not class conditioned, the classifier could have directly calculated P(C_1|w)/P(C_0|w). The P(w) term cancels out, and you end up with the product of ratios of feature probabilities conditioned on the classes.

Note though that, for K-classes, K>2, the number of parameters you need to store would blow up. You would need have these N ratios: P(w_j|C_x)/P(w_j|C_y) for all possible classes x,y. Here N is the number of features. So, in all N*C(K,2) values. On the contrary, multiclass softmax regression (the discriminative analogue for NB) would need NK parameters.

In the multiclass problem, they have the same number of degrees of parameters. You can see this by choosing a reference class c_y. The NB parameter for class c_x and feature w_j is f_jxy = log(p(w_j|c_x)/p(w_j|c_y)). If you want a new reference class c_w, you can see that f_jxw = f_jxy - f_jwy. No new learned parameters needed. You need one of those parameters per N features and (K-1) classes that aren't c_y. So you get N(K-1) features. This is the same as for multiclass softmax regression: N(K-1) instead of the NK you wrote (which you can see by working it out with a reference class c_y). It really is the same parametrization. The analogue with NK parametrized softmax may be more straightforward if you just use NK naive bayes features of the form f_jx = log(p(w_j|c_x)) for each class and check the equivalence with softmax.

It really is just a special case of a more general rule that a given PGM with a fixed parametrization can be trained discriminatively or generatively.

I'm sorry I didn't have a chance to respond before -you've been very patient with your responses! Thanks!

Yes, this is quite neat!

I will look up the rule/theorem you mention. It seems "kind of" reasonable in the sense the same number of independent parameters should show up in any representation, but (and this is probably me just being dumb) I need to think about why wouldn't the form of the representations not affect the number of parameters.

Although, if m parameters in a representation can be mapped to n (m>n) parameters in another, the m parameters aren't really independent.

OK, now I am just talking to myself :-)

Also, in this context, the results around discriminative classifiers learning better than their generative counterparts (asymptotically) is something I need to think about. [1]

Well, there goes my next few evenings.

[1] https://ai.stanford.edu/~ang/papers/nips01-discriminativegen...