Hacker News new | ask | show | jobs
by dekhn 1491 days ago
it took me 20 years to learn this body of knowledge and now it can just sort of be summed up in a paragraph.

When I learned and used gradient descent, you had to analytically determine your own gradients (https://web.archive.org/web/20161028022707/https://genomics....). I went to grad school to learn how to determine my own gradients. Unfortunately, in my realm, loss landscapes have multiple minima, and gradient descent just gets trapped in local minima.

2 comments

This is the case most contemporary neural networks as well. It turns out for many domains, a "good" local minima generalizes well across many tasks.
Huh. I talked to some experts and they told me NN loss functions are bowl-shaped and have single minima, but those minima take a very long time to navigate to in high dimensional spaces.
For higher feature counts the real concern is saddle points rather than minima, where the gradient is so small that you barely move at all each iteration and get "stuck".
To add here: for a local minimum to occur all those dimensions (or features) need to increase. This is highly unlikely for modern NNs where you have millions of dimensions. If one of the dimensions is going down but the rest up, you have a saddle point. Since you go down only one (or few) dimensions it takes longer.
What's your realm?
protein folding and structure prediction. Protein simulations typically define an energy function, similar to a loss function, over all the atoms in the protein. There are many terms: at least one per bonded atom pair, at least one per bonded atom triple, at least one per bonded atom quadruple, one per each non-bonded pair (although atoms that are distant can be excluded, sometimes making this a sparse matrix). If you start with a proposed model (say, random coordinates for all the atoms) and apply gradient descent, you'll end up with a mess. All those energy terms end up creating a high dimensional surface that is absurdly spiky in the details, and extremely wavy with many local minima at coarse grain.

Instead of using gradient descent, we used molecular dynamics (I'm unaware if this has a direct equivalent) to sample the space by moving along various isocontours (constant energy, or constant temp, or usually constant pressure). Even so, you have to do a lot of sampling- in my day, it was years of computer time, now it's months- to get a good approximation to the total landscape, and measure transition frequencies between areas of the landscape that correspond to energy barries (local maxima) that are smaller than the thermal energy avaialble to the system.

It's complicated. also, deep mind obviated all my work by providng that sequence data (which is cheap to obtain) can be used to predict very accurate structures with little or no simulation.