I was quite startled when I saw that article published as I had been using this very method for years. Once you make the connection that quantiles (and not just a median) are the minimum of a suitably chosen loss function the rest is very straightforward.
Starting from the observation that the expectation of X is the constant c which minimizes the squared loss E[(X - c)^2], we can now generalize expectation by generalizing the loss function we aim to minimize.
They do this by asymmetrically weighting over- or under-estimates, unlike the squared loss which is symmetric.
This apparently has nice properties which the paper goes into.
I think everyone has left the building. Just in case you are still here let me try. BTW am a fan of your popular math stuff.
TLDR expectiles are to mean what quantiles are to median.
A longer explanation follows.
Mean can be looked upon as a location that minimizes a scheme of penalizing your 'prediction' of (many instances of) a random quantity. You can assume that the instances will be revealed after you have made the prediction. If your prediction is over/larger by e you will be penalized by e^2. If your prediction is lower by e then also the penalty is e^2. This makes mean symmetric. It punishes overestimates the same way as underestimates.
Now if you were to be punished by absolute value |e| as opposed to e^2 then median would be your best prediction. Lets denote the error by e+ if the error is an over-estimate and -e- if its under. Both e+ and e- are non-negative. Now if the penalties were to be * e+ + a e- * that would have led to the different quantiles depending on the values of a > 0. Note a \neq 1 introduces the asymmetry.
If you were to do introduce a similar asymmetric treatment of e+^2 and e-^2 that would have given rise to expectiles.
And would the 2 memory algorithm be equivalent to a gradient descent with momentum?
I used to know what a sub gradient was, but I think there must be something more to the ideas in the paper because I’m struggling to see the analogy between gradient descent where you take steps probabilistically and the algorithm described. Perhaps I need to think about how you could potentially recast the quantile estimation problem as an optimisation problem and then apply what is effectively the machinery developed the train neural nets. Very interesting connection!
Recasting quantile estimation as an optimization problem is trivial: the q-quantile minimizes the “pinball” loss (see first eqn in http://statweb.stanford.edu/~owen/courses/305a/lec18.pdf) with parameter q. What they do in the paper is to take subgradient steps with respect to the latest observation (just think about subgradients as gradients, since the loss function is everywhere differentiable except for one point)
The lingo is complex here, because it's general enough to be used for much more complicated cases.
Think of it as a 'hello world' program. The typically 'hello world' program in eg Java teaches you more about the lingo of Java than about solving the problem of putting 'hello world' on the screen.
(Of course, there are still plenty of bad reasons to describe simple things in complex lingo. But the above is one good reason.)
Actually, it looks like in the paper something else is going on other than subgradient steps: there is some more randomization going on, that can prevent some steps from being taken. So yeah, there is a connection with online subgradient, but also more to it :-)
Thanks for the loss function reference! I wonder if there’s something waiting to be discovered here about doing gradient descent but only taking steps with some probability. Definitely something to think about, I can’t imagine this idea hasn’t been explored before.
Thanks a lot for the insightful comments, I’ve definitely seen that work in a very new light after knowing about it for years!!
I was quite startled when I saw that article published as I had been using this very method for years. Once you make the connection that quantiles (and not just a median) are the minimum of a suitably chosen loss function the rest is very straightforward.
Then there is expectiles too.