| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by data_scientist 3245 days ago
	It doesn't have a good accuracy. I have yet to see a real-life dataset where it's better than to just call LogisticRegressionCV from scikit-learn. For bigger datasets, you may use vowpal wabbit or Fasttext. It may be a little bit slower for the training (but not so much), and as fast as LR for the training. What is the purpose of using an algorithm when another one is just better?

1 comments

boto3 3245 days ago

There's a reason for that. NB needs to fit more params than LR (and other linear discriminant learners), e.g. for binary classification, NB needs to fit 2*N params while LR just N.

link

imh 3245 days ago

Not true. Let's derive it. Here's NB for a binary classification:

    p(c|w)*p(w) = p(w|c) * p(c)

    p(c|w) = p(w|c) * p(c) / p(w)

    p(c|w) = p(w|c) * p(c) / [sum_i p(w|c_i) * p(c_i)]

Let's look at the probability of class 1.

    p(c_1|w) = p(w|c_1) * p(c_1) / [sum_i p(w|c_i) * p(c_i)]

Notice how the numerator is going to show up in the denominator. We can simplify that by bringing it into the denominator:

    p(c_1|w) = 1 / [sum_i p(w|c_i) * p(c_i) / p(w|c_1) / p(c_1]

Then cancel it out:

    p(c_1|w) = 1 / [1 + p(w|c_0) / p(w|c_1) * p(c_0) / p(c_1)]

Not let's apply the NB assumptions:

    p(w|c_0) / p(w|c_1) = prod_i p(w_i | c_0) / p(w_i | c_1)

Now, if you take the log of the final p(c_1|w) I derived, the product in p(w|c_0) / p(w|c_1) turns into a exponentiated sum, giving you one parameter per word, plus an intercept for the log of p(c_0) / p(c_1). You end up with exactly the same 1/(1+exp(linear stuff)) with the same parametrization and form you have in logistic regression [0].

This is a broader thing that a given graphical model with a given fixed parametrization can be generatively or discriminatively trained. They will end up learning different models, but that's because they make different assumptions, not because of different parametrizations.

[0] linear stuff = intercept + sum_i (coefficient of word_i)

link

abhgh 3245 days ago

I may be missing something, but how do you take a log on the RHS of p(c_1|w)? The denominator has a sum i.e. "1+ something", which the log doesn't distribute over.

link

imh 3245 days ago

Heh, nope you're right I was writing to quickly. I meant to say to take a log of of the probability term in p(c_1|w). So you take p(c_1|w) = 1/(1+stuff) and turn it into p(c_1|w) = 1/(1+exp(log(stuff))). stuff is a product, so log(stuff) is a sum. Good catch.

link

abhgh 3245 days ago

Oh, I see it now. Thanks for clarifying!

In that case, you needn't have kept the denominator around at all. Since P(w) is not based on a class i.e. its not class conditioned, the classifier could have directly calculated P(C_1|w)/P(C_0|w). The P(w) term cancels out, and you end up with the product of ratios of feature probabilities conditioned on the classes.

Note though that, for K-classes, K>2, the number of parameters you need to store would blow up. You would need have these N ratios: P(w_j|C_x)/P(w_j|C_y) for all possible classes x,y. Here N is the number of features. So, in all N*C(K,2) values. On the contrary, multiclass softmax regression (the discriminative analogue for NB) would need NK parameters.

link

imh 3244 days ago

In the multiclass problem, they have the same number of degrees of parameters. You can see this by choosing a reference class c_y. The NB parameter for class c_x and feature w_j is f_jxy = log(p(w_j|c_x)/p(w_j|c_y)). If you want a new reference class c_w, you can see that f_jxw = f_jxy - f_jwy. No new learned parameters needed. You need one of those parameters per N features and (K-1) classes that aren't c_y. So you get N(K-1) features. This is the same as for multiclass softmax regression: N(K-1) instead of the NK you wrote (which you can see by working it out with a reference class c_y). It really is the same parametrization. The analogue with NK parametrized softmax may be more straightforward if you just use NK naive bayes features of the form f_jx = log(p(w_j|c_x)) for each class and check the equivalence with softmax.

It really is just a special case of a more general rule that a given PGM with a fixed parametrization can be trained discriminatively or generatively.

link