Hacker News new | ask | show | jobs
by cs702 2178 days ago
NeurIPS, 2020: "We need more sample-efficient algorithms for finding better hyperparameters that specify how to train computationaly expensive deep learning models."

Rich Sutton, 2019: "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin." (https://news.ycombinator.com/item?id=23781400)

I wonder if in the end simply throwing more and more computation at the problem of finding good hyperparameters will end up working better as computation continues to get cheaper and cheaper.

3 comments

It certainly does look that way for certain classes of problems, as witnessed by the evolution of GPT language models, where the model gets better through sheer use of compute resources.

For many combinatorial problems however, improvements in algorithms can often produce bigger strides than just throwing brute force compute at the problem. Take Mixed Integer Programs (MIPs) -- roughly the optimization-equivalent of SATs -- used for airline scheduling, optimal assignment problems and such. In slide 12 [1] (there are other sources that corroborate), the author notes that MIP solver performance between 1988-2017 had improved 2,527,768,000x.

17,120x was due to machine improvements (single core). 147,650x was due to algorithmic improvements. Multiple cores can also provide a performance boost up to a point, before saturating due to coordination costs. The author notes that "A typical MIP that would have taken 124 years to solve in 1988 will solve in 1 second now".

The biggest improvements in MIP algorithm performance have been due to improvements in solver heuristics (!), because the fastest computations are those that don't have to be performed at all -- i.e. that are eliminated via heuristics.

[1] http://www.focapo-cpc.org/pdf/Linderoth.pdf

> 17,120x was due to machine improvements (single core). 147,650x was due to algorithmic improvements.

That is just... insane. I knew only vaguely that performance had improved significantly for many NP-hard/complete problems in practice, but I did not realize the magnitude of improvement, especially due to better algorithms.

> The biggest improvements in MIP algorithm performance have been due to improvements in solver heuristics (!), because the fastest computations are those that don't have to be performed at all -- i.e. that are eliminated via heuristics.

That is also... remarkable. Thank you for sharing.

I can't help but agree with you :-)

EDIT: Given that most of the "algorithmic" improvements have been due to better solver heuristics, I imagine it should be possible to train meta DL/RL models that learn how to find good heuristics for training DL models with high sample efficiency. Come to think of it, this competition seems to be asking precisely for such "black-box heuristic-guessing" models, so clearly there are people working on it.

Yes, when Bill Bixby (founder of CPLEX and Gurobi, companies that made and still make the fastest MIP solvers in the world) revealed similar numbers a few years ago, many of us practitioners were astounded too.

But those improvements were also a product of lots of smart people funded by cash-rich industries (i.e. oil & gas, airlines... MIPs are big bucks commercially. I recall at one point a commercial CPLEX license was $100k list) poking at the problem for over 30 years. Many Ph.D.s in operations research and mathematical optimization were generated on this topic alone.

Makes sense.

Note that now we have lots of smart people funded by new cash-rich industries (search, social networks, SaaS, etc.) poking at the problem with "black-box" approaches. I will be interesting to see what comes out of it.

Great example of the fact that different circumstances require different approaches, and the fact that brute force is increasingly impractical for combinatorially complex problems. Thanks for this reference.
Agree. For classes of problems that human have poor understandings (e.g biology) even better algorithms can not provide better answers. The optimisation might need to happen elsewhere such as re-phrasing the problem or better understanding of related fields (neighbour area). However sample efficiency is indeed very helpful in data deprived area such as biomedical research.
This sounds like the reason Bertsimas at MIT published his new book.
A very reasonable point and, certainly, the direction that parts of the computational community have embraced over the years. I will use integration as an example: classic computational methods were focused on trying to make strong assumptions about the integrand and significantly reduce the number of integrand evaluations (Gauss quadrature is the main thing that comes to mind). As computation became more accessible/parallelizable, and problems became less analytic, Monte Carlo methods have become more fundamental.

In some distributed computational settings, memory traffic is actually the main bottleneck and redundant computations are executed to reduce the need to send data (a similar situation to the one you aptly describe).

I think that, in the case of hyperparameter/meta-learning optimization (or search, depending on how you think about it) we are at a time right now where the complexity of models which can effectively be put into production is a function of our ability to, at least partially, analyze the space of possible modeling decisions. Will we escape that, and have models whose training cost is less significant than the cost of executing an "intelligent" hyperparameter search process? Maybe ... I am a GP person so I see potential in clever analysis of circumstances so that RKHS methods (for instance) can be leveraged and simplify the training process. But the current trajectory of the community has been to work on increasingly expensive models, which makes the ability to effectively use them with limited tuning/search cost still relevant.

Yes. Upon further reflection, this competition actually looks like another step in that direction (see this thread: https://news.ycombinator.com/item?id=23784239).

Otherwise I agree, Gaussian Processes are nice and friendly, and work quite well for low-dimensional search (e.g., from a few to hundreds of hyperparameters) under very natural, general assumptions :-)

to do level-2 optimization you still have to run the level-1 problem repeatedly. So you'll only ever spend a small fraction of your resources on meta-optimization. The levels become exponentially more costly.