Hacker News new | ask | show | jobs
by jj12345 3079 days ago
Maybe an expert will chime in here, but I recall learning that several concepts in ML are derived from or in parallel to the BIC. Many use feature AIC/BIC measures for feature selection.

On a separate note, AIC/BIC are prevalent in linear regression models because they are extremely well-behaved in comparison to some other models. This has generated enormous amounts of field-specific experimental literature. For instance, economists can form expectations for AIC/BIC measurements dependent on the study-type.

Additionally, we already have some decent tools for regularization (albeit model-dependent ones). Counting the nodes of a NN to establish an AIC may not lead to a proper understanding of the free parameters.

This wasn't meant to be an apologists take on why I'm not using AIC/BIC for model selection. If we're talking about "deep learning", then it's definitely a worthwhile goal to examine how to optimize network topologies. The link below mentions using reinforcement learning to learn architecture.

https://research.googleblog.com/2017/05/using-machine-learni...

1 comments

It looks like some work has been done on using AIC for neural networks:

- https://www.sciencedirect.com/science/article/pii/B978044489...

- https://www.sciencedirect.com/science/article/pii/S095219760...

- https://waseda.pure.elsevier.com/en/publications/network-inf...

In principle it would be straightforward, right? AIC = 2k - 2ln(L). So set k = # weights + # biases, and use the log-likelihood as the objective function so you can just read off L from there.

I wonder if the reason why AIC is unpopular is that it's harder to explain to your boss than accuracy, precision, recall, or even proper scoring. This is perhaps more true now that statistical literacy in management increases -- the notion that you can't use training data to estimate performance is becoming popular. Now here comes a magic formula, calculated on the training set, that supposedly tells me how well the model will perform... that's not gonna fly in a lot of settings.

There's also the question of whether it even tells us what we want to know. AIC is an estimate of Kullback-Leibler information of a probability model, under somewhat-restrictive conditions [0]. The question of "why don't we use AIC?" might be the same as the question of "why don't we use proper scoring rules? -- people want to know accuracy, so they just go ahead and estimate accuracy by brute-force resampling. I'm not saying it's right, but until people see tangible value in thinking about their models from a probabilistic perspective, they won't be motivated to do so.

[0]: http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf