Hacker News new | ask | show | jobs
by salty_biscuits 2724 days ago
Still a problem, but not a problem at the same time. Downhill in the loss function is always better, so you may not be at the global minima but you might be at a good enough spot anyway. Using SGD gives you a bit of local minima hopping ability as well. Interesting question to think about what the loss surface looks like, as it is so high dimensional that it might just be saddles everywhere. The difference between a local minima and the global might be practically nothing in terms of classifier performance. The convexity of the loss function for SVMs is a mixed blessing, you are guaranteed to be at the global min but the optimization problem doesn't scale with large training data or big feature vectors. Hence the historical feature engineering efforts or sacrificing this property to a more scalable optimization method. So in short the ability to use high dimensional features in a NN means you don't get any guarantee about if the minima you get will be the best, but you lose less by not having to cut down the size of the input vector by hand (i.e. working directly with an image rather than some embedding with keypoint descriptors etc).