|
|
|
|
|
by aoki
3030 days ago
|
|
i would request that you stop critiquing "machine learning" based on the presentation in introductory online materials like this and the ng coursera course. you provide a lot of signal in general but i think these critiques do decrease your SNR. i am certain that you are familiar with the "usual" statistics sequence. (for others: there are lower-division courses that use calculus in a few places but otherwise avoid it, focusing instead on memorizing procedures. there is the upper-division probability/math stat sequence that uses calculus heavily but avoids analysis. and there is an intro phd sequence that finally gets into measure theory.) if you look at a coursera course that gives an high-level overview of practical statistics and simplifies its presentation to be accessible to people who never took calculus, you can criticize the very idea of a course that does not explain the measure-theoretic issues, but it makes no sense to use it to criticize the field of statistics, or to criticize the competence of others in the institution that produced it. here, google is producing introductory training materials for developers. many developers have never taken calculus, let alone optimization, statistics, or analysis, and when i took MLCC internally, you were supposed to go through this whole thing (lectures and coding) in two days. it's supposed to give you enough understanding of the concepts to understand the API and apply it. |
|
The Google statement I quoted was flatly wrong. It is really important for students to be told that.
I gave some references to more in statistics.
> the measure-theoretic issues
I didn't mention measure theory, and the statistics references I gave don't mention measure theory either. I referenced Fleming only as background on convexity, and there is no measure theory in that part of Fleming.
You mentioned the role of calculus for the math of regression: That use of calculus, for deriving the normal equations, is neither necessary nor, really, sufficient. There is another derivation, nicer, fully detailed mathematically, with no calculus at all. The core of the idea is that the minimization of the squared error has to be an orthogonal projection and, then, presto, bingo, get the normal equations. And there are more advantages to that derivation. To keep my post simple, I omitted that derivation, but a good treatment of regression without calculus could use it.
I omitted the standard definition of convexity, but apparently in this discussion we need that. By omitting the definition, the Google material was not so good.
Definition (convex function): For the set of real numbers R and a positive integer n, a function f: R^n --> R is convex provided for any u, v in R^n and any a in [0,1] we have
f(au + (1-a)v) <= af(u) + (1-a)f(v)
So, for a picture, on the graph of f(x), we have two points (maybe not distinct) (u, f(u)) and (v, f(v)). Then we draw the line between these two points. The number a determines where we are on that line. With a = 0, we are at point (v, f(v)). With a = 1, we are at point (u, f(u)). Then as we move a from 0 to 1, we move along that line. We also have on the graph the point, say, P
(au + (1 - a)v, f( au + (1-a)v ) )
And on the line we drew, we have point, say, Q
(au + (1 - a)v, af(u) + (1-a)f(v))
Well, we are asking that point Q be the same as point P or directly above point P. That is, the line we drew is on or above the graph of (x, f(x)). The line is sometimes called a secant line and is said to over estimate the function.
Definition (concave): The function -f is concave if and only if the function f is convex.
As in Fleming, a convex function is continuous. Intuitively, the proof is based on two cones, and as we approach a point we get herded between the two cones. The cones are from the convexity assumption. Draw a picture.
IIRC, there is a result in Rockafellar that a convex function is differentiable almost surely with respect to Lebesgue measure, but this is the only connection I would make with convexity and measure theory.
For convex f, the set of all (x, y) where y >= f(x) is the epigraph of f, that is, the region on or above the graph of f.
Definition (convex set): A subset C of R^n is convex provided for any u, v in C and any a in [0,1] the point
au + (1-a)v
is also in set A.
Well, as in Fleming, the epigraph is convex. It is also closed in the usual topology of R^n.
Definition (closed set): A subset C of R^n is closed (in the usual topology of R^n) provided for any sequence x_n, n = 1, 2, ... in C that converges to y in R^n, y is also in C.
In particular, if we define the boundary of C, set C contains is boundary. So, the interval [0,1] is closed and the interval (0,1) is not closed.
Well, for any closed convex set and point u on its boundary, there exists a hyperplane that passes through point x and where set C is a subset of the closed half space on one side of the hyperplane.
Such a hyperplane is said to be supporting for set C at point x.
Intuitively, in R^3, push convex set C to be in contact with a wall. Suppose point x on the boundary of set C is in contact with the wall. Then the wall is a supporting plane for set C at x, and set C is a subset of the room side of the wall.
Or think of a big, solid, irregular piece of cheese and John Belushi as his Samurai Tailor swinging his sword: John keeps swinging his sword in arcs that are in flat planes and cuts down the irregular cheese to a convex hunk. So, the convex C has been determined (formed) from the supporting hyperplanes from Belushi's sword.
For another way to make a convex set, take a piece of wood and press it against a belt sander until the boundary consists of only flat sides.
Let a solid rock roll around in a steam of water for a few thousand years, and may end up with a smooth, shiny, convex rock.
Faster a chicken egg is convex.
In general, a closed convex set is the intersection of its supporting hyperplanes.
Then, we can approximate a convex set with some of its supporting hyperplanes. In particular, we can approximate a convex function with some supporting hyperplanes of its epigraph. At times, this can be useful -- it's the main idea behind Lagrangian relaxation in constrained optimization (I used that once).
In particular, the epigraph of a convex function is the intersection of the supporting hyperplanes. In that case, a supporting hyperplane is called a subgradient. The function is differentiable at the point of contact if and only if the subgradient is unique. If the subgradient is unique, then it is just the tangent hyperplane from the gradient of the function.
In R^3, a cube is a convex set. Each of its sides is part of a supporting hyperplane. For a point on the boundary of the cube that is not on an edge, the supporting hyperplane at that point is unique. The corners and edges of the cube also have supporting hyperplanes, but they are not unique.
In R^3 a sphere is convex. Then it is also the intersection of its supporting hyperplanes. At each point on the boundary of the sphere, the supporting hyperplane is unique.
Similarly for epigraphs.
So, the function f: R*n --> R where for each x f(x) = 0 is convex, concave, and linear, and each x is both a maximum and a minimum of f. So, the minimum of a convex function need not be unique. In this case, the epigraph is just a closed half space.
"Look, Ma! No calculus!" And no measure theory.
Exercise: Derive the regression normal equations via perpendicular projections and without calculus.
Exercise: Argue the role of perpendicular projections in the minimization in regression.