| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by graycat 3031 days ago

In the course, in lecture

"Reducing Loss: Gradient Descent"

"Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges."

The first sentence is flatly wrong: E.g., for positive integer n and the set of real numbers R, function f: R^n --> R where for all x in R^n f(x) = 0, f is convex, concave, and linear, and for all x in R^n x is a minimum and a maximum of f.

Can there be uncountably infinitely many alternative minima for the Google ML problems? Yes, e.g., just enter one of the independent variables twice.

The second sentence is nonsense.

Grotesque, outrageous incompetence!!!!

It has long been known that minimizing a convex function, even a differentiable convex function, with just gradient descent can be just horribly inefficient. A LOT is known about how to do much better than just gradient descent. E.g., there is Newton iteration (right, that Newton, hundreds of years ago) and quasi-Newton. And there's more.

Why so inefficient? Well, draw a picture like Google did except use just two independent variables instead of just the one in the Google picture. Then see that the resulting, convex "bowl" can be like a long, narrow boat with a very gentle slope in one direction and a very steep slope in an orthogonal direction. Yes the cross section of the bowl can be, first cut, an ellipse with one short axis and one long one. Sure, the axes are eigenvectors, etc. and the ellipse is part of a local quadratic approximation. Well, gradient descent keeps going back and forth nearly parallel to the short axis of the ellipse and making nearly no progress on the long axis. People have known this and known good things to do about it for, uh, at least half a century.

For the Google ML problems, might (1) tweak Newton iteration to improve the rate of convergence to a minimum or (2) at each iteration don't get a gradient descent but get a supporting hyperplane of the epigraph of the convex function, as the iterations proceed, accumulate these hyperplanes, notice that they lead to an approximation of the full epigraph, and use linear programming or some tweak of that to minimize the hyperplane approximation to the convex function -- there is much more that can be said here. By the way, the convex function for that ML problem is quite special, e.g., quadratic.

A lot has long been known and well polished in regression and classification, e.g., with TeX markup:

N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis,\/} John Wiley and Sons, New York, 1968.\ \

Leo Breiman, Jerome H.\ Friedman, Richard A.\ Olshen, Charles J.\ Stone, {\it Classification and Regression Trees,\/} ISBN 0-534-98054-6, Wadsworth \& Brooks/Cole, Pacific Grove, California, 1984.\ \

C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition,\/} ISBN 0-471-70823-2, John Wiley and Sons, New York, 1967.\ \

A good start on convexity is

Wendell H.\ Fleming, {\it Functions of Several Variables,\/} Addison-Wesley, Reading, Massachusetts, 1965.\ \

Total, fun, dessert ice cream on convexity is Jensen's inequality; right away can use it to prove a lot of classic inequalities.

Gee, look on the upside!!!! From this sample, the claims of machine learning (ML) revolutionizing the economy are nonsense!!!! And for startups, don't much have to worry about serious competition from Google!!!!

3 comments

aoki 3030 days ago

i would request that you stop critiquing "machine learning" based on the presentation in introductory online materials like this and the ng coursera course. you provide a lot of signal in general but i think these critiques do decrease your SNR.

i am certain that you are familiar with the "usual" statistics sequence. (for others: there are lower-division courses that use calculus in a few places but otherwise avoid it, focusing instead on memorizing procedures. there is the upper-division probability/math stat sequence that uses calculus heavily but avoids analysis. and there is an intro phd sequence that finally gets into measure theory.) if you look at a coursera course that gives an high-level overview of practical statistics and simplifies its presentation to be accessible to people who never took calculus, you can criticize the very idea of a course that does not explain the measure-theoretic issues, but it makes no sense to use it to criticize the field of statistics, or to criticize the competence of others in the institution that produced it.

here, google is producing introductory training materials for developers. many developers have never taken calculus, let alone optimization, statistics, or analysis, and when i took MLCC internally, you were supposed to go through this whole thing (lectures and coding) in two days. it's supposed to give you enough understanding of the concepts to understand the API and apply it.

link

graycat 3030 days ago

If you find something wrong mathematically or otherwise with something I write, then by all means let me know. So far you have found nothing. Details:

The Google statement I quoted was flatly wrong. It is really important for students to be told that.

I gave some references to more in statistics.

> the measure-theoretic issues

I didn't mention measure theory, and the statistics references I gave don't mention measure theory either. I referenced Fleming only as background on convexity, and there is no measure theory in that part of Fleming.

You mentioned the role of calculus for the math of regression: That use of calculus, for deriving the normal equations, is neither necessary nor, really, sufficient. There is another derivation, nicer, fully detailed mathematically, with no calculus at all. The core of the idea is that the minimization of the squared error has to be an orthogonal projection and, then, presto, bingo, get the normal equations. And there are more advantages to that derivation. To keep my post simple, I omitted that derivation, but a good treatment of regression without calculus could use it.

I omitted the standard definition of convexity, but apparently in this discussion we need that. By omitting the definition, the Google material was not so good.

Definition (convex function): For the set of real numbers R and a positive integer n, a function f: R^n --> R is convex provided for any u, v in R^n and any a in [0,1] we have

f(au + (1-a)v) <= af(u) + (1-a)f(v)

So, for a picture, on the graph of f(x), we have two points (maybe not distinct) (u, f(u)) and (v, f(v)). Then we draw the line between these two points. The number a determines where we are on that line. With a = 0, we are at point (v, f(v)). With a = 1, we are at point (u, f(u)). Then as we move a from 0 to 1, we move along that line. We also have on the graph the point, say, P

(au + (1 - a)v, f( au + (1-a)v ) )

And on the line we drew, we have point, say, Q

(au + (1 - a)v, af(u) + (1-a)f(v))

Well, we are asking that point Q be the same as point P or directly above point P. That is, the line we drew is on or above the graph of (x, f(x)). The line is sometimes called a secant line and is said to over estimate the function.

Definition (concave): The function -f is concave if and only if the function f is convex.

As in Fleming, a convex function is continuous. Intuitively, the proof is based on two cones, and as we approach a point we get herded between the two cones. The cones are from the convexity assumption. Draw a picture.

IIRC, there is a result in Rockafellar that a convex function is differentiable almost surely with respect to Lebesgue measure, but this is the only connection I would make with convexity and measure theory.

For convex f, the set of all (x, y) where y >= f(x) is the epigraph of f, that is, the region on or above the graph of f.

Definition (convex set): A subset C of R^n is convex provided for any u, v in C and any a in [0,1] the point

au + (1-a)v

is also in set A.

Well, as in Fleming, the epigraph is convex. It is also closed in the usual topology of R^n.

Definition (closed set): A subset C of R^n is closed (in the usual topology of R^n) provided for any sequence x_n, n = 1, 2, ... in C that converges to y in R^n, y is also in C.

In particular, if we define the boundary of C, set C contains is boundary. So, the interval [0,1] is closed and the interval (0,1) is not closed.

Well, for any closed convex set and point u on its boundary, there exists a hyperplane that passes through point x and where set C is a subset of the closed half space on one side of the hyperplane.

Such a hyperplane is said to be supporting for set C at point x.

Intuitively, in R^3, push convex set C to be in contact with a wall. Suppose point x on the boundary of set C is in contact with the wall. Then the wall is a supporting plane for set C at x, and set C is a subset of the room side of the wall.

Or think of a big, solid, irregular piece of cheese and John Belushi as his Samurai Tailor swinging his sword: John keeps swinging his sword in arcs that are in flat planes and cuts down the irregular cheese to a convex hunk. So, the convex C has been determined (formed) from the supporting hyperplanes from Belushi's sword.

For another way to make a convex set, take a piece of wood and press it against a belt sander until the boundary consists of only flat sides.

Let a solid rock roll around in a steam of water for a few thousand years, and may end up with a smooth, shiny, convex rock.

Faster a chicken egg is convex.

In general, a closed convex set is the intersection of its supporting hyperplanes.

Then, we can approximate a convex set with some of its supporting hyperplanes. In particular, we can approximate a convex function with some supporting hyperplanes of its epigraph. At times, this can be useful -- it's the main idea behind Lagrangian relaxation in constrained optimization (I used that once).

In particular, the epigraph of a convex function is the intersection of the supporting hyperplanes. In that case, a supporting hyperplane is called a subgradient. The function is differentiable at the point of contact if and only if the subgradient is unique. If the subgradient is unique, then it is just the tangent hyperplane from the gradient of the function.

In R^3, a cube is a convex set. Each of its sides is part of a supporting hyperplane. For a point on the boundary of the cube that is not on an edge, the supporting hyperplane at that point is unique. The corners and edges of the cube also have supporting hyperplanes, but they are not unique.

In R^3 a sphere is convex. Then it is also the intersection of its supporting hyperplanes. At each point on the boundary of the sphere, the supporting hyperplane is unique.

Similarly for epigraphs.

So, the function f: R*n --> R where for each x f(x) = 0 is convex, concave, and linear, and each x is both a maximum and a minimum of f. So, the minimum of a convex function need not be unique. In this case, the epigraph is just a closed half space.

"Look, Ma! No calculus!" And no measure theory.

Exercise: Derive the regression normal equations via perpendicular projections and without calculus.

Exercise: Argue the role of perpendicular projections in the minimization in regression.

link

aoki 3030 days ago

my only goal in mentioning the statistics sequence at all was to give a familiar example where the standard sequences vary in depth depending on audience. a trivial point, yes, but i wanted to be concrete because it's the internet. apparently that was a terrible choice, as it was far too close to the topic at hand; my apologies for making you search so hard for a connection.

i made my request because it's jarring for me as a reader when you punctuate your (often delightful) expository writing with conclusions about entire fields and large organizations that seem (on the face of it) to be justified by old and/or very limited data.

but that's a selfish request, and you are of course free to tell me to get lost and post whatever you want (and i'll still read it); i'm certainly not going to pursue this further, aside from the apology and clarifying comment above.

link

graycat 3030 days ago

You do understand that for all the talk and new terminology and claims of "learning" in the "machine learning" (ML) in the Google OP, what is in the OP is a poor introduction to some highly polished material in "regression analysis" in 50 year old books. So, the ML stuff is adulterated old wine in new bottles with new labels. That is essentially intellectual theft and corruption, and without references essentially academic plagiarism. You should be offended.

If they are going to plagiarize, even just teach, regression analysis, then at least they shouldn't make a mess out of it, and a mess is what they made. Google should "get that MESS OFF the Internet".

Students should be told the truth: Regression is powerful stuff. Sometimes the results can be valuable. The Google OP is an introduction to regression and does have some value. But the Google material is a MESS, and students should be informed that they are getting really low quality material and should see some references to some beautifully polished material.

So, I helped any students who would be the target audience for the Google OP.

You should know this; I believe you do.

I'm offended by the mess and passing that out to students trying to learn. You should also be offended.

link

diyseguy 3030 days ago

Google dumbs everything down because they think everyone is dumb. I have learned to avoid their documentation and attempts to teach the populace.

link

graycat 3030 days ago

Correction of a typo:

> So, the function f: R*n --> R where for each x f(x)

should read

So, the function f: R^n --> R where for each x f(x)

Excuse: Just now I'm using a keyboard on a laptop, and I'm not used to the keyboard yet.

link

stared 3031 days ago

http://p.migdal.pl/2017/04/30/teaching-deep-learning.html -> "What mathematicians thing I do"

(Full disclaimer - I did theoretical physics, so understand both sides. :))

link

telchar 3031 days ago

You should recheck your definitions on convexity.

>function f: R^n --> R where for all x in R^n f(x) = 0

This hyperplane is not convex. A convex curve by definition can not be equal to its tangent at any point.

Edit: I should specify, I mean a convex curve cannot be completely equal to any of its tangents, obviously it will equal each tangent at a single point.

link

graycat 3030 days ago

You don't want to consider just "tangents" and, instead, consider what I defined as supporting hyperplanes of the epigraph and subgradients of the function. If the gradient exists, that is, if the function is differentiable, then the subgradient really is a tangent. Otherwise can have many different subgradients supporting at one point on the curve and its epigraph.

It's simple: A cube has supporting planes at each point that is an edge or corner, but those points do not have tangents.

link

tzs 3031 days ago

It sounds like you are describing curves that are strictly convex. Curves that are convex, but not strictly convex, can intersect their tangents at more than one point, or even at every point.

I'm going by the definition of convex function given in Rudin's "Principles of Mathematical Analysis", Apostol's "Calculus", Wikipedia, and MathWorld.

link

telchar 3031 days ago

Fair enough. I suppose pointing out that the authors merely omitted "strictly" wouldn't have served GP's point as well.

link

graycat 3030 days ago

Strictly convex need have no role in this Google ML material. Just convex is enough.

link

graycat 3030 days ago

> This hyperplane is not convex. A convex curve by definition can not be equal to its tangent at any point.

No, my math is fully correct, and your claim is wrong.

For a lecture Convexity 101, see my

https://news.ycombinator.com/item?id=16498564

link