I may be missing something, but how do you take a log on the RHS of p(c_1|w)? The denominator has a sum i.e. "1+ something", which the log doesn't distribute over.
Heh, nope you're right I was writing to quickly. I meant to say to take a log of of the probability term in p(c_1|w). So you take p(c_1|w) = 1/(1+stuff) and turn it into p(c_1|w) = 1/(1+exp(log(stuff))). stuff is a product, so log(stuff) is a sum. Good catch.
In that case, you needn't have kept the denominator around at all. Since P(w) is not based on a class i.e. its not class conditioned, the classifier could have directly calculated P(C_1|w)/P(C_0|w). The P(w) term cancels out, and you end up with the product of ratios of feature probabilities conditioned on the classes.
Note though that, for K-classes, K>2, the number of parameters you need to store would blow up. You would need have these N ratios: P(w_j|C_x)/P(w_j|C_y) for all possible classes x,y. Here N is the number of features. So, in all N*C(K,2) values. On the contrary, multiclass softmax regression (the discriminative analogue for NB) would need NK parameters.
In the multiclass problem, they have the same number of degrees of parameters. You can see this by choosing a reference class c_y. The NB parameter for class c_x and feature w_j is f_jxy = log(p(w_j|c_x)/p(w_j|c_y)). If you want a new reference class c_w, you can see that f_jxw = f_jxy - f_jwy. No new learned parameters needed. You need one of those parameters per N features and (K-1) classes that aren't c_y. So you get N(K-1) features. This is the same as for multiclass softmax regression: N(K-1) instead of the NK you wrote (which you can see by working it out with a reference class c_y). It really is the same parametrization. The analogue with NK parametrized softmax may be more straightforward if you just use NK naive bayes features of the form f_jx = log(p(w_j|c_x)) for each class and check the equivalence with softmax.
It really is just a special case of a more general rule that a given PGM with a fixed parametrization can be trained discriminatively or generatively.
I'm sorry I didn't have a chance to respond before -you've been very patient with your responses! Thanks!
Yes, this is quite neat!
I will look up the rule/theorem you mention. It seems "kind of" reasonable in the sense the same number of independent parameters should show up in any representation, but (and this is probably me just being dumb) I need to think about why wouldn't the form of the representations not affect the number of parameters.
Although, if m parameters in a representation can be mapped to n (m>n) parameters in another, the m parameters aren't really independent.
OK, now I am just talking to myself :-)
Also, in this context, the results around discriminative classifiers learning better than their generative counterparts (asymptotically) is something I need to think about. [1]