Hacker News new | ask | show | jobs
by imh 3251 days ago
In the multiclass problem, they have the same number of degrees of parameters. You can see this by choosing a reference class c_y. The NB parameter for class c_x and feature w_j is f_jxy = log(p(w_j|c_x)/p(w_j|c_y)). If you want a new reference class c_w, you can see that f_jxw = f_jxy - f_jwy. No new learned parameters needed. You need one of those parameters per N features and (K-1) classes that aren't c_y. So you get N(K-1) features. This is the same as for multiclass softmax regression: N(K-1) instead of the NK you wrote (which you can see by working it out with a reference class c_y). It really is the same parametrization. The analogue with NK parametrized softmax may be more straightforward if you just use NK naive bayes features of the form f_jx = log(p(w_j|c_x)) for each class and check the equivalence with softmax.

It really is just a special case of a more general rule that a given PGM with a fixed parametrization can be trained discriminatively or generatively.

1 comments

I'm sorry I didn't have a chance to respond before -you've been very patient with your responses! Thanks!

Yes, this is quite neat!

I will look up the rule/theorem you mention. It seems "kind of" reasonable in the sense the same number of independent parameters should show up in any representation, but (and this is probably me just being dumb) I need to think about why wouldn't the form of the representations not affect the number of parameters.

Although, if m parameters in a representation can be mapped to n (m>n) parameters in another, the m parameters aren't really independent.

OK, now I am just talking to myself :-)

Also, in this context, the results around discriminative classifiers learning better than their generative counterparts (asymptotically) is something I need to think about. [1]

Well, there goes my next few evenings.

[1] https://ai.stanford.edu/~ang/papers/nips01-discriminativegen...