Hacker News new | ask | show | jobs
by metakermit 3047 days ago
I like the parallel between optical lens systems and deep learning. I'm also kind of disappointed by the "arcane lore" status hyper-parameters have in different ML domains. I think it would be healthier for the community to make it a habit to explicitly document why a certain topology and layer sizes were selected. It's like providing documentation with your open source project – yes, it would be possible for knowledgable people to use it without it, but much more difficult and beginner unfriendly.
3 comments

I wonder how documentable the space of hyperparameters really is (which is I think what the OP is poking at) with the current way we conceive of them, and also with how experiments currently happen.

Often, people either reuse other people's architectures, or simply try 2 or 3 and stick with the best one, only changing the learning rate and such.

I also wonder if there's a computation issue (training is long, we can only try so many things), or if it really is that we are working in the wrong hyperparameter space. Maybe there is another space we could be working in, where the HPs that we currently use (learning rate, L2 regularization, number of layers, etc.) are a projection from that other HP space where "things make more sense".

In this regard, it is similar to how natural sciences are done. The hyperparameter space of possible experiments is immense, they are expensive, so one has to go with intuition and luck. Reporting this is difficult.

[edit:] In this analogy, deep learning currently misses any sort of a general theory (in the sense of theories explaining experiments).

In this regard, it is similar to how natural sciences are done. The hyperparameter space of possible experiments is immense, they are expensive, so one has to go with intuition and luck. Reporting this is difficult.

I'd agree it's done in a sort-of scientific way. But I don't think you can say it's done the way natural science is done. A complex field, like oceanography or climate science, may be limited in the kind of experiments it can do and may require luck and intuition to produce a good experiment. But such science is always aiming to reproduce an underlying reality and the experiment aim to verify or not a given theory.

The process of hyperparameter optimization doesn't involve any broader theory of reality. It is essentially throwing enough heuristics at a problem and tune enough that they more or less "accidentally" work.

You use experiment to show this heuristic approximation "works" but this sort of approach can't be based on a larger theory of the domain.

And it's logical that there can't be a set theory of how any approximation to any domain works. You can have a bunch of ad-hoc descriptions of approximation each of which works with a number of common domains but it seems logical these will remain forever not-a-theory.

Exploring even a tiny, tiny, tiny part of the hyperparam space takes thousands of GPUs. And that is for a single dataset and model---change anything and you have to redo the entire thing.

I mean, maybe some day, but right now, we're poking at like 0.00000000001% of the space, and that is state-of-the-art progress.

A DNN might be more effective at exploring the hypyerparameter space than people are with their intuition and luck. Rumor is Google has achieved this.
Google simply has the computational resources to cover thousands of different hyperparameter combinations. If you don't have that, you won't ever be able to do systematic exploration, so you might as well rely on intuition and luck.
This is not accurate. Chess alone is so complex, brood force would still take an eternity, and they certainly don't have a huge incentive to waste any money just to show off (because that would reflect negatively on them).

But how does it work? It's enough to outpace other implementations, alright. But the model even works on a consumer machine, if I remember correctly.

I have only read a few abstract descriptions and I have no idea about deep learning specifically. So the following is more musing than summary:

They use the Monte Carlo method to generate a sparse search space. The data structure is likely highly optimized to begin with. And it's no just a single network (if you will, any abstract syntax tree is a network, but that's not the point), but a whole architecture of networks --modules from different lines of research pieced together, each probably with different settings. I would be surprised if that works completely unsupervised; after all it took months from beating go to chess. They can run it without training the weights, but likely because the parameters and layouts are optimized already, and to the point of the OP, because some optimization is automatic. I guess what I'm trying to say is, if they extracted features from their own thought process (ie. domain knowledge) and mirrored that in code, than we are back at expert systems.

PS: Instead of letting processors run small networks, take advantage of the huge neural network experts have in their head and guide the artificial neural network into the right direction. Mostly, information processing follows insight from other fields, and doesn't deliver explanations. The explanations have to be there already. It would be particularly interesting to hear how the chess play of the developers involved has evolved since and how much they actually do understand the model.

I'm curious why you believe to be able to tell that my comment is not accurate when you yourself admit that you have no idea about deep learning?

Note that I'm not saying that Google is doing something stupid or leaving potential gains on the table. What I'm saying is that their methods make sense when you are able to perform enough experiments to actually make data-driven decisions. There is just no way to emulate that when you don't even have the budget to try more than one value for some hyperparameters.

And since you mentioned chess: The paper https://arxiv.org/pdf/1712.01815.pdf doesn't go into detail about hyperparameter tuning, but does say that they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.

Google did do this. It took ungodly amounts of computing power and only did slightly better than random search. They didn't even compare to old fashioned hill climbing or Bayesian optimization.
I've considered gradient descent for optimizing parameters on toy problems at university a few times. Never actually did it though, it's a lot of hassle for the advantage of less interaction at the cost of no longer building some intuition.
A step in the right direction would be to encourage sharing negative results. It's important to know what to avoid too.
It is not often the case that someone actually knows why a hyperparam or architecture choice works. We pretend, sometimes, but frankly, it's mostly made up junk to cover the fact that most ML research involves a huge amount of intuitive guesswork and trial-and-error.

And the loss surfaces vary. Even just changing the dataset or even the input size alters the loss surface and can easily break a model.

It's not called Gradient Descent by Grad Student for nothing.

>I think it would be healthier for the community to make it a habit to explicitly document why a certain topology and layer sizes were selected.

Also, which other topologies were tried and failed to produce good results. It's amazing that this information is missing from most modern ML papers.