Hacker News new | ask | show | jobs
by a-dub 2721 days ago
So the pitch is that you don't have to do feature engineering... but then instead it seems people do network structure engineering with featurish things like convolutions.

The performance is still better in most cases but I often have to wonder, are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?

2 comments

I guess one upshot to the SVM approach is that there's math for quantifying how well a given model will generalize, subject to some assumptions.

Is there anything like that in the ANN world?

In short, no. Not for practically large models used in common tasks like image classification or speech to text.
are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?

Not really, or sort of, depending on how you think.

A deep neural network does work - at least to some extent - because of the large number of parameters. However, it is practical because it can be trained in a reasonable amount of time.

Things like ResNets are useful because they allow us to train deeper networks.

You can create a SVM with the same number of parameters[1], and in theory it could be as accurate (this is basically the no free lunch theorem[2]). But you won't be able to train it to the same accuracy.

[1] Of course there are practical concerns about what you do for features, since hand created features just aren't as good as neural network ones. One thing people do now is use the lower layers of a deep neural network as a feature extractor and then put a SVM on top of them as the classifier. This works quite well, and is reasonably fast to train.

[2] https://en.wikipedia.org/wiki/No_free_lunch_theorem

But you won't be able to train it to the same accuracy.

I'm not sure I agree with this bit in theory. A Neural Network is a stack of basis functions; and this stack can also be seen as a bunch of basis functions. And basis functions are what kernels represent. Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.

The reason this doesn't work in practice is, in SVMs, you tend not to learn kernels from scratch but use (possibly a combination of) standard parameterized kernels - [1]. The learning step in the SVM adapts this standard kernel to your dataset as much as the parameters allow, but this would be sub-optimal compared to learning a kernel (or the corresponding basis functions) from scratch that's built just for your data. With a well trained ANN the latter is what you get.

[1] there has been a fair amount of work on learning kernels too, but its not as mainstream as using standard kernels.

Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.

Sure.

I'm not sure I agree with this bit in theory.

No one really agrees with it in theory - I'm not aware of a good theoretical explanation as to why some deep networks are easier to train. And yet there is a growing body of real, generalized practical hints which work pretty reliably.

This is pretty exciting! There is undiscovered ideas here. But it is unsatisfactory from the theoretical sense at the moment.