Hacker News new | ask | show | jobs
by stpn 381 days ago
(post author here)

I was curious about this since it kind of makes sense, but I offer a few reasons why I think this isn't the case:

- In the 10% noise case at least, the second descent eventually finds a minima that's better than the original local minima which suggests to me the model really is finding a better fit rather than just reducing itself to a similar smaller model

- If it were the case, I think we might also expect the error for larger models to converge to the performance of smaller models? But instead they converge lower and better

- I checked the logged gradient histograms I had for a the runs. While I'm still learning how to interpret the results, I didn't see signs of vanishing gradients where dead neurons later in the model prevented earlier layers from learning. Gradients do get smaller over time but that seems expected and we don't have big waves of neurons dying which is what I'd expect to have the larger network converge on the size of the smaller one.

1 comments

Thanks for the analysis. I'm a seasoned ML researcher but I wasn't aware of the phenomenon. Not sure yet how to make sens of it but the blog post was great