Hacker News new | ask | show | jobs
by mlthoughts2018 2672 days ago
This sounds like parody to me. There are so many problems in applied statistics, and neural networks are not helpful for most of them. Consider Bayesian analysis for very small data sets as an example (just the tip of the iceberg).

In financial services in particular, there are tons of time series and regression problems on small data such that a neural network (beyond perhaps some super small MLP) would be a ridiculous thing to try.

I think the breakdown of workload you described will only happen in business departments where there is a need for large scale embedding models, enhanced multi-modal search indices, computer vision and natural language applications, and maybe a handful of things that eventually productize reinforcement learning. I could also see this happening in businesses that can benefit from synthetically generated content, like stock photography, essays / news summaries / some fiction, website generators, probably more.

What I described above is a tiny drop in the ocean of applied statistics problems that business have to solve.

4 comments

It's another example of the FAANG + Bay Area Startups world versus the other 99% of Corporate America. In the latter world, most of the "machine learning" in production is traditional stuff like Random Forest, SVM, and more recently Gradient Boosting. Hell, Marketing departments across the country are still running old school decision tree (CART and CHAID) models and logistic regression models written in SAS 20+ years ago. DL/NN is a minuscule proportion of production ML in the enterprise space.
I think there is good reason that "old" machine learning models are more popular than DNN in the enerprise space. Most of the data is in the tabular format. What is more, "old" and simple decision tree or linear model are very easy to understand, deploy and are fast. There is for sure clear advantage of having even simple decision tree implemented in the system than making decisions at random.
The main reason though is that these other methods outperform neural nets in tons of different situations. Even just from an accuracy / business success metric point of view, many problems are just better solved with other classes of models, domain-specific feature engineering, etc. It will probably remain so for many decades at least.
DNN's make good features though, especially if you have time series data or lots of text.

I agree that the final model should be a randomforest/xgboost/lightgbm for typical tabular data.

I meant that extracting an intermediate layer as a feature embedding and then sticking a classical model on top of it performs worse than curating features through domain-specific expert tuning, for a ton of diverse application domains.
Deep Learning also works on very small data sets by means of embeddings. A large model trained on large data sets can be used as feature extraction tool for training for small data sets.
Re-using an existing model to generate embeddings doesn’t work well for auxiliary tasks with very small data. Even if you do no fine-tuning at all, you need to have big data sets in terms of the auxiliary task too.

For example, consider needing to train hundreds of unique small models every day, based on new customer inputs affecting causality effects for that day (I had to do this for ad forecasting in a past job).

Generating embeddings via pre-trained models essentially produced gibberish and performed far worse than custom feature engineering + simple logistic models.

I’ve seen this mentioned before, including a blog post by the fast.ai folks. Any idea where I can get details? If my tabular data set is small, what kind of embedding can I get out of it? Or is the idea that a larger data set is used for embeddings of categorical data?
Pre-trained embeddings are only helpful if they are trained on a different (ideally larger) dataset or even a different task, but with the same kind of input data. So you would need to find out where else something similar to the data in your tables appears. If some of the data is text, word embeddings may be applicable. Or if you're trying to analyze user activity by time and location, you might try to transfer what can be learned about the influence of holidays from publicly observable activity e.g. on Twitter (just a random idea that popped into my head, no guarantee that it can actually work).

Of course if all you have are numbers without context, there isn't a lot you can do to improve the situation.

I think this is mainly a thing for perception (images and sounds). Tabular data would have to match up with the training dataset, and "most" interesting tabular models are the sports of things guarded like piles of gold by the businesses that build them...
The parent did not specifically talk about NNs. As I understand it AutoML could apply to all statistical endeavours that involve estimation (classical or bayesian).
> “AutoML could apply to all statistical endeavours that involve estimation”

Yes, this is the part that sounds like parody to me. At least, as a working statistician, I can tell you that the concept of AutoML could not apply to the far majority of things I work on.

Could you give an example? I have a hard time understanding what you could mean, as Algorithm Configuration & Selection is such a general framework. If you are solely talking about the current state of the art, I would agree that techniques from AutoML do not have the generality and autonomity of an expert human.
For example, look into Chapter 5 on logistic regression from the Gelman & Hill book on hierarchical models & regression.

It walks through an example with arsenic data in wells and a problem of estimating how distance, education and some other factors relate to a person’s willingness to travel to a clean well for water.

Deciding on how to standardize the input features, how to rescale for regression coefficients to be interpretable in meaningful human units, how to interpret statistics of the fitted model to decide whether a feature is helping or hurting by adding it (since this cannot be deduced from raw accuracy metrics alone), how to interpret deviance residual plots for outlier analysis, etc.

All those things have nothing to do with changing the architecture of the model, except possibly including or excluding features, and in that example there were no hyperparameters to tune, and the inference problem would not make sense for hyperparameter tuning on raw accuracy outputs anyway, since the goal was not optimizing prediction but rather understanding impact of features that have semantic meaning in the contexf of possible policy choices that could be adopted.

By way of contrast, applying an automated subset selection algorithm to automatically choose the features would be a naive idea with likely bad results in that case, and setting up an optimization framework that would optimize over possible transformations or standardizations of the inputs seems equally dubious compared with expert, context-aware human judgment.

And this is a very trivial example. If you modify a problem like this to address causal inference goals, or add some type of cost optimization on top of it, it becomes more and more complex, but exactly in a way that a tool like AutoML can’t help with.

In other words, making an AutoML that can truly apply to all types of estimation or inference problems is no easier than solving strong AI computer vision and natural language problems entirely, since you need contextual reasoning and creative proposals for inventing features and sleuthing the goodness of fit of a certain model architecture in light of the human-level inference goal you’re trying to reach.

So you never tune hyperparameters or try different models to see which works better?
I do plenty of that, and AutoML could help with a small fraction of that.
The problem is "Applied Statistics" became "Machine Learning" which became "AI" which became "Deep Learning".

Throw away all the BS. and, yes, it's obvious.