| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stellalo 2036 days ago
	That is essentially an online subgradient descent, with fixed stepsize 1, for minimizing the L1 loss.

2 comments

srean 2036 days ago

Yes indeed.

I was quite startled when I saw that article published as I had been using this very method for years. Once you make the connection that quantiles (and not just a median) are the minimum of a suitably chosen loss function the rest is very straightforward.

Then there is expectiles too.

link

bollu 2036 days ago

Can you offer an ELIUndergrad of what an expectile is and where I can read more about them?

link

DavidSJ 2036 days ago

The introduction here appears to explain: https://projecteuclid.org/download/pdfview_1/euclid.ejs/1473...

Starting from the observation that the expectation of X is the constant c which minimizes the squared loss E[(X - c)^2], we can now generalize expectation by generalizing the loss function we aim to minimize.

They do this by asymmetrically weighting over- or under-estimates, unlike the squared loss which is symmetric.

This apparently has nice properties which the paper goes into.

link

srean 2034 days ago

I think everyone has left the building. Just in case you are still here let me try. BTW am a fan of your popular math stuff.

TLDR expectiles are to mean what quantiles are to median.

A longer explanation follows.

Mean can be looked upon as a location that minimizes a scheme of penalizing your 'prediction' of (many instances of) a random quantity. You can assume that the instances will be revealed after you have made the prediction. If your prediction is over/larger by e you will be penalized by e^2. If your prediction is lower by e then also the penalty is e^2. This makes mean symmetric. It punishes overestimates the same way as underestimates.

Now if you were to be punished by absolute value |e| as opposed to e^2 then median would be your best prediction. Lets denote the error by e+ if the error is an over-estimate and -e- if its under. Both e+ and e- are non-negative. Now if the penalties were to be * e+ + a e- * that would have led to the different quantiles depending on the values of a > 0. Note a \neq 1 introduces the asymmetry.

If you were to do introduce a similar asymmetric treatment of e+^2 and e-^2 that would have given rise to expectiles.

link

bollu 2034 days ago

Fascinating, thanks a lot! This is a great introduction :)

link

tadkar 2036 days ago

And would the 2 memory algorithm be equivalent to a gradient descent with momentum?

I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!

link

stellalo 2036 days ago

Recasting quantile estimation as an optimization problem is trivial: the q-quantile minimizes the “pinball” loss (see first eqn in http://statweb.stanford.edu/~owen/courses/305a/lec18.pdf) with parameter q. What they do in the paper is to take subgradient steps with respect to the latest observation (just think about subgradients as gradients, since the loss function is everywhere differentiable except for one point)

link

zaroth 2036 days ago

I hate it when the complexity of the lingo dramatically exceeds the complexity of the algorithm. Language shouldn’t be the barrier to understanding.

This seems to be particularly true in computer learning. We’re taking about a conditional step function here, right?

link

eru 2036 days ago

The lingo is complex here, because it's general enough to be used for much more complicated cases.

Think of it as a 'hello world' program. The typically 'hello world' program in eg Java teaches you more about the lingo of Java than about solving the problem of putting 'hello world' on the screen.

(Of course, there are still plenty of bad reasons to describe simple things in complex lingo. But the above is one good reason.)

link

stellalo 2036 days ago

Actually, it looks like in the paper something else is going on other than subgradient steps: there is some more randomization going on, that can prevent some steps from being taken. So yeah, there is a connection with online subgradient, but also more to it :-)

link

tadkar 2036 days ago

Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before. Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!

link

ppereira 2036 days ago

See quantile regression and hinge loss functions.

link