Hacker News new | ask | show | jobs
by asknemo 5217 days ago
This happens all the time in machine learning applications. Or many other engineering disciplines I dare say. If theorems and laws never need some tweaks here and there in the real world, what do we need hackers and engineers for?
5 comments

Often the tweaks are then used to inspire more solid mathematical footing. An interesting example of this is going on with the recent surge of interest in neural networks and deep learning at machine learning conferences. What used to be hacks and heuristics are being given a more rigorous narrative. Of course, as soon as we have a better model for neural networks, someone immediately finds a non-rigorous tweak to improve its performance. And the cycle goes on...
A good example is regularization. You have nice proofs saying that your classifier is optimal, then you tack on a regularization term to it, which breaks your optimality proof but improves your classification accuracy. It seems unexpected, but it's not really all that surprising when you get down to the details of it.
Oops hit the down-arrow without intending to, my bad, hope someone will fix that.

There is nothing tacked on about a regularizer though, it is very sound even in theory. There are several ways to look at it. One way is to see it as a natural consequence of Bayes law, it is just the log of the prior probability. There are certain things we know or assume about the model even before looking at the data, for example we expect the predictions to have a certain smoothness etc, all this knowledge can incorporated into the prior model, and that is what the regulaizer is. Another way to look at it from stability of the estimates of the parameters. I find the former more convincing.

Absolutely, there's a pretty clear mathematical justification for regularization. However, it is very literally tacked on at the end. Take logistic regression, if you minimize the cost function without regularization, you get a max-likelihood estimate of the regression parameters. But what we do is to add a regularization term to that cost function. Minimizing that cost-function will no longer give a MLE solution, but it will (likely) give a better solution. It all comes down to understanding that the MLE property is an asymptotic result. Same goes for covariance matrix estimates, where you have regularization procedures that are guaranteed to never be worse than the plain MLE solution.
As an engineer, you should also be aware of when discovering the basis of the tweak is crucial. Discovering that tweaking the beam bending equations gives a much better fit to your test results on the beams you would like to use for your building is one example.

In some cases, these tweaks provide better results for a small range of conditions. That small range may be big enough for you (given your task at hand), but without understanding the tweak, you can't actually know. So care must be taken.

This kind of thing is awesome in a way. I get the sense that machine learning really feeds on people attacking problems from both ends, the elegant probabilistic side and the practical optimisation hacks both inspire each other.

Some potential downsides for a hack which isn't backed with any theory though, just to demonstrate why it might be worth trying to do some theory after spotting one of these hacks, from a practical not just an aesthetic perspective:

- It may have an impact on convergence properties and numerical stability of any optimisation algorithm you're using to fit the model. Convergence speed, quality of local maxima attained, whether it even converges to a local minimum of your cost function at all, whether there are any guarantees that it doesn't sometimes blow up numerically in a horrible way...

- In general it may be brittle, with the circumstances under which it works well poorly understood. Will it break as your dataset grows? will it work on slightly different kinds of datasets?

- Too many arbitrary parameters to tweak can be expensive unless you have a smart way to optimise them (smarter than grid search + cross-validation)

- Maintainability. It can be frustrating trying to re-use work when people have been less than completely honest in documenting things like "this term/factor/constant was pulled out of my arse and seems to work well on this one dataset, caveat emptor".

One should often consider applying tweaks to a final system. If there's an obvious place to introduce a free parameter, it seems silly not to do so and cross-validate the parameter against application performance.

Things get out of hand if there are many such possible tweaks, or multiple components are combined, each with interacting tweaks. Then some principles behind the tweaks need identifying. Or at least a differentiable cost function to target.