|
In the course, in lecture "Reducing Loss: Gradient Descent" is "Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges." The first sentence is flatly wrong: E.g., for positive integer n and the set of real numbers R, function f: R^n --> R where for all x in R^n f(x) = 0, f is convex, concave, and linear, and for all x in R^n x is a minimum and a maximum of f. Can there be uncountably infinitely many alternative minima for the Google ML problems? Yes, e.g., just enter one of the independent variables twice. The second sentence is nonsense. Grotesque, outrageous incompetence!!!! It has long been known that minimizing a convex function, even a differentiable convex function, with just gradient descent can be just horribly inefficient. A LOT is known about how to do much better than just gradient descent. E.g., there is Newton iteration (right, that Newton, hundreds of years ago) and quasi-Newton. And there's more. Why so inefficient? Well, draw a picture like Google did except use just two independent variables instead of just the one in the Google picture. Then see that the resulting, convex "bowl" can be like a long, narrow boat with a very gentle slope in one direction and a very steep slope in an orthogonal direction. Yes the cross section of the bowl can be, first cut, an ellipse with one short axis and one long one. Sure, the axes are eigenvectors, etc. and the ellipse is part of a local quadratic approximation. Well, gradient descent keeps going back and forth nearly parallel to the short axis of the ellipse and making nearly no progress on the long axis. People have known this and known good things to do about it for, uh, at least half a century. For the Google ML problems, might (1) tweak Newton iteration to improve the rate of convergence to a minimum or (2) at each iteration don't get a gradient descent but get a supporting hyperplane of the epigraph of the convex function, as the iterations proceed, accumulate these hyperplanes, notice that they lead to an approximation of the full epigraph, and use linear programming or some tweak of that to minimize the hyperplane approximation to the convex function -- there is much more that can be said here. By the way, the convex function for that ML problem is quite special, e.g., quadratic. A lot has long been known and well polished in regression and classification, e.g., with TeX markup: N.\ R.\ Draper and
H.\ Smith,
{\it Applied Regression Analysis,\/}
John Wiley and Sons,
New York,
1968.\ \ Leo Breiman,
Jerome H.\ Friedman,
Richard A.\ Olshen,
Charles J.\ Stone,
{\it Classification and Regression Trees,\/}
ISBN 0-534-98054-6,
Wadsworth \& Brooks/Cole,
Pacific Grove, California,
1984.\ \ C.\ Radhakrishna Rao,
{\it Linear Statistical Inference and
Its Applications:\ \
Second Edition,\/}
ISBN 0-471-70823-2,
John Wiley and Sons,
New York,
1967.\ \ A good start on convexity is Wendell H.\ Fleming,
{\it Functions of Several Variables,\/}
Addison-Wesley,
Reading, Massachusetts,
1965.\ \ Total, fun, dessert ice cream on convexity is Jensen's inequality; right away can use it to prove a lot of classic inequalities. Gee, look on the upside!!!! From this sample, the claims of machine learning (ML) revolutionizing the economy are nonsense!!!! And for startups, don't much have to worry about serious competition from Google!!!! |
i am certain that you are familiar with the "usual" statistics sequence. (for others: there are lower-division courses that use calculus in a few places but otherwise avoid it, focusing instead on memorizing procedures. there is the upper-division probability/math stat sequence that uses calculus heavily but avoids analysis. and there is an intro phd sequence that finally gets into measure theory.) if you look at a coursera course that gives an high-level overview of practical statistics and simplifies its presentation to be accessible to people who never took calculus, you can criticize the very idea of a course that does not explain the measure-theoretic issues, but it makes no sense to use it to criticize the field of statistics, or to criticize the competence of others in the institution that produced it.
here, google is producing introductory training materials for developers. many developers have never taken calculus, let alone optimization, statistics, or analysis, and when i took MLCC internally, you were supposed to go through this whole thing (lectures and coding) in two days. it's supposed to give you enough understanding of the concepts to understand the API and apply it.