Hacker News new | ask | show | jobs
by wdabney 2349 days ago
First author of the paper here. If the article piques your interest, you can read the paper in question here: http://rdcu.be/b0mtA
2 comments

Hi Will, thanks for the reply. One thing I'm curious about: this paper discusses how a neuron that uses a different slope for positive and negative updates will converge to an expectile of the reward distribution. And the behavior is very interpretable, in that a slope of 3 for negative updates and a slope of 2 for positive updates will lead the neuron to converge to the 60th expectile (3 / 3+2) if my understanding is correct.

But it seems that the more common approach in reinforcement learning is to estimate quantiles via regression rather than to get expectiles via asymmetric updates as in mice neurons. Do you have intuition for why this performs better? And is there an analog to the asymmetrically updating neuron in the quantile case?

Hi, thanks for posting the news story.

We can think about asymmetric regression more generally. If you have an error and apply some 'response' function f to that error you change the estimator you learn. In the case of quantile regression f is a sign function, expectile regression it is identity.

In my opinion, and this is entirely speculation, I think with further experiments more completely studying the effect we found in our paper, that we will find the response function (f) in the brain is not linear, but a type of saturating function like if we smoothed the sign function out. We repeated our experiments in the paper using such a function, which has been proposed for dopamine neuron responses before, and the analysis continues to hold because the rewards are all quite small and likely simply in the linear region of a non-linear response function (we know firing rate saturates eventually so this isn't much of a surprise).

Regarding quantiles being more commonly used, it's actually the other way around. The Huber-quantiles we saw perform best in the QR-DQN paper, and which most often get used in the follow-on RL work, are actually more like the type of saturating non-linearity you might expect in the brain (although the Huber loss is not as smooth as you probably would expect the neuron response to be).

Hey Will-- congratulations on the publication. It's very cool to see an algorithm reflected in the brain, and the turnaround (from idea to implementation to experiment) in... what, three years? is astonishingly fast.

On a more technical note, I was curious about whether the distribution is being approximated "properly" (in the sense of probabilities summing to one, no negative probabilities) via expectile regression. Does that hold here? I'm less than an expert on neurology so I am unsure if that would be necessary in vivo (since it's clearly not needed for good performance in silico).

Thanks! It felt like a very long time, but yes for neuroscience it is extremely fast and was only possible because Naoshige Uchida and his lab had already done the rodent experiments around probabilistic reward delivery.

There are a lot of open questions here, so anything I could say about the brain itself would be more of a guess. That said, for our proposed model no negative probabilities are needed, as the distribution is represented by a population of estimators for different predictors of value (in the general sense).

Hope that makes sense and helps clarify.