Hacker News new | ask | show | jobs
by chriszhang 2012 days ago
We trained a deep learning model to look at like 20 system parameters and predict an output. the parameters were binary. So one curios engineer decided to brute-force the trained model with all possible inputs like 2^20 inputs to see what the model does. he found for the problem we were solving only 4 of the 20 parameters had effect on results. the remaining approx 16 parameters do not affect results.

So he replaced the model with a single line of code with one boolean expression made with those 4 parameters connected with logical operators.

6 comments

That kind of problem, with such a limited number of parameters, really shouldn't be thrown into a neutral network. A decision tree (or varient) might have been the ideal ML technique, and you may have been able quickly see what parameters mattered and reduce the four parameters to code if needed.

Neural networks make sense with huge number of input parameters where feature selection is really tricky to reason about and decision boundaries are very non-linear such as image classification.

Edited: slight clarification

When I studied AI in uni in the cold cold (cold) winter of AI and this kind of input was really significant, but most problems we consider ML now are vastly more complex and other problems can be addressed by things that are no longer considered AI at all (while they were back then).

It is funny how my uni top research (on 1m$ computers) neural nets are considered to make no sense anymore. That went a lot faster than programming.

There's no rule that when the number of parameters is small deep learning shouldn't be used. The one time where deep learning maybe shouldn't be attempted at all is when the number of samples is very limited. While it excels with high dimensional hierarchical data it can do well on other problems as well. It differentiates from problem to problem and usually multiple solutions are tried and compared, starting with EDA and linear regression.
> There's no rule that when the number of parameters is small deep learning shouldn't be used

I would be genuinely interested in examples of problems with a very low number of predictors (say two to five) when a neutral net would be appropriate (where as you say less complex methods have been tried and failed).

I just can't think of one.

Suppose you have to fit a fairly non-linear curve to make interpolated predictions. A NN could do that with fewer parameters than most other models.

I can't think of a method that would use fewer parameters. If nothing else, it's a decent way to compress the data set for interpolation (on nearby averages) as a use case, no?

For interpolation (just to be clear, not regression, i.e. interpolation means the curve has to pass through every point in the data exactly), polynomial interpolation gives a unique polynomial of lowest possible degree [1]; I'm not sure a NN would have fewer parameters than this for interpolation, strictly speaking.

To your point, I believe you meant "rough interpolation", and it's true in many cases NN's might produce a less overfitted approximating function if one has no prior knowledge of the generating function.

But if one can exploit prior knowledge, one can select an optimal set of basis functions and fit a more parsimonious model than a NN. For instance, if you knew that a nonlinear function was a function of sin, cos and logs, selecting these as basis functions and finding the correct functional form [2] would likely help an optimizer find more parsimonious model than a NN using standard activation functions (ReLU, sigmoid, etc). As a thought experiment, suppose the generating function was this: (5 parameters)

  y = a1*log(a2*x)/cos(a3*x) + a4*sin(a5*x)
If one attempted to fit this with log, cos and sin basis functions, one is likely recover this form with ~5 parameters. But suppose we tried to fit this with an NN with the stipulation that the approximation error is under some ε -- I suspect we'll need quite a bit more than 5 parameters.

NN's tend to generalize better (assuming proper regularization) than polynomial approximations and have fewer numerical problems like Runge's phenomenon, but I don't think NNs aim for (or have results that demonstrate) parsimony in parameters.

[1] https://en.wikipedia.org/wiki/Polynomial_interpolation

[2] If the functional form is unknown, there are techniques like "symbolic regression" that attempt to do a structure search to find a well-fitting structure. https://en.wikipedia.org/wiki/Symbolic_regression

For online learning, multi-output, non-negative output, unlabeled data etc neural networks works well. The power of deep learning lies in how you can shape the problem and loss function for specific purposes. And even if these circumstances do not exit they can do well, it's all problem specific.
Historically the XOR function has been the simple example that many ML algorithms can't handle. Just imagine a higher dimensional XOR with outliers, and you have a pretty good use case for DL with limited predictors.
Historically, this was solved in the 80's with the multi-layer perceptron, but it seems it still gets repeated.
20 Parameters is WAY too small for a deepnet. Deepnets are better suited for very high dimensional spaces where the data has sparsity, a hierarchical structure and can take advantage of nets rotational, translational invariance etc (if that architecture is used)

You cant pick completely the wrong tool, and then complain about how unsuitable it was.

Was there any rationale to using a "deep learning model" to train a 20 parameter model? Sounds like some amateur DS convinced the team this is a good idea because he thought deep learning was cool?
For up to 100ish parameters, even mixed with floating point I recommend trying the midaco solver a friend of mine develops. MINLP, ant colony method (i.e. gradient descent with many restarts). From my experience this runs circles around NNs for this class of problems (parameter optimization with relatively low complexity and/or limited amount of training data available).
Yep there are well established and powerful tools for various problems : linear programming, boolean satisfiability, analytical solutions, etc. NeuralNetworks and co are for a very specific, yet large, class of problems.
wouldn't principal component analysis have done the same thing without the brute forcing?
The engineer wouldn't have been able to waste the week feeling fancy fiddling with GPUs in that case.
Since the variables are categorical, won't PCA have to be modified to use it effectively? To my knowledge, PCA can only be used for continuous variables.
Yes. I think so. Or maybe some correlation plots. I doubt that a proper EDA was performed pre-modelling.
Have you tried training the network using dropout?
yes, dropout and other regularization techniques were used