Hacker News new | ask | show | jobs
by cyocum 1475 days ago
I occasionally work with data in the Humanities. The data here is often very, very small. I talk to other Humanities researchers and I often find that they really want to get on the ML bandwagon but they do not realize the sheer amount of data that they need to make ML as practiced today work. I have not looked into small dataset techniques in a long time (I have a day job so I do not get much chance to do this often) but I hope that one day we can find a technique that will work.

One side note, when I speak to other Humanities researchers about this, I always tell them that I have yet to find a technique that will give them novel insights. These techniques almost always tell the researchers things that they already know. I usually follow this up with a note that even formalizing Humanities knowledge in statistical or other computational terms is highly valuable and worth doing. Maybe someone else can take that formalism and build on top of it something truly new.

2 comments

> These techniques almost always tell the researchers things that they already know.

Yes, but sometimes in surprising ways.

I build a simple decision-tree model for a medical study, looking at outcomes for acute pneumonia. Went with a single tree over a forest because the model had to be interpretable. Statistically it was almost as good as the forest; I built it using fields with high feature importance values. Thus there is a chance that any 'improvement' by the forest was overfitting. but I digress.

The tree said that blood CO2 levels were the most important factor. The doctors weren't surprised by this (though they had some internal debate if this was more or less important than some other factors). What did surprise them was the cutoff level.

They said they would be concerned if CO2 was above 7. My model had the cutoff at 9.5. Sorry, I forget the units.

Point is, it confirmed what they knew (CO2 levels matter when assessing lung function), but still surprised them (CO2 levels have to be much higher than normal before this becomes discriminant over other factors, such as age).

What is even small dataset machine learning and how is it different from statistical forecasting?
You can fit a linear regression with just a few points, technically if you have one more data point than regressors, it works. Because you've assumed a linear relationship with normally distributed errors. And you can interpret the output, because the values of the regression coefficients tell you something. "having a high score on X doubles the odds of outcome Y", for example.

Also, because you've assumed a structure to the data, you can more easily test if the data has deviated from that structure. This can be data drift or single outliers-- for example, GARCH models (a type of regression) allow the normal distribution of the error to have a varying variance, so you can detect different variance regimes.

In short, they help a human understand and interpret data.

From what little I know, ML is not so good at that. But it has other advantages, and you don't always need or want the understanding. If your want to i.e. detect ground cover in satellite images, then all you care about is valid outputs, not necessarily the importance of near-infrared vs red band.

And ML (can) beat regression models by providing better interpolation, by better handling regions of the data space which violate the assumptions of the regression model, etc.

So it is a tradeoff. Both approaches are highly performant, just at different tasks.

An example I have used a few times to highlight the difference between "understanding how a model gives predictions" and "model uses predictors/features in a way that makes sense" is a linear regression model that Bondarchuk, a famous throwing coach of former USSR. The linear regression was used to predict the distance expected in one of the throws, given the performances in other athletic tasks, say max squat, max power clean, max long jump and a few others.

Some of the regression coefficients were negative, which means that increasing performance in, say, long jump, leads to shorter shot-put (I don't remember which throw the model was for) distances.

The model and approach looked understandable and "weird" at the same time. From a purely statistical perspective it makes sense, since that was the results coming from, I assume, maximum likelihood estimation. From a predictive performance, retrospectively it surely worked because it gave good prediction on past data, assuming there were training and test data (most likely, they were not, but let's assume).

But from a future prediction perspective, i.e. the forecasting and thus the manipulation of training to obtain a certain performance, did it make sense? I am very confident it did not, because, among other things, the performances of auxiliary lifts/feats were not independent (you cannot work on a heavier one rep max in the power clean and hope or work toward a shorter long jump performance).

The model by itself might have accurate, but considering that interpretable and thus guiding changes in the training program would have been a quite naive mistake. This kinda mistake is quite common among many who think too much about the machinery of the model and way too little about the domain.

> But from a future prediction perspective, i.e. the forecasting and thus the manipulation of training to obtain a certain performance, did it make sense?

Those are two different questions. For forecasting without manipulation of training it would still make sense. But it wouldn’t make sense for causal analysis.

That is a true difference and I should have been clearer. I should have specified that I don't believe they used any test data, the "study" had probably been done with no test data set and simply using all data for the estimation of regression parameters.

I believe that (1) the model made little "mechanistic" sense, (1) the forecasting accuracy of the model would have been low, (3) the model had good hind-casting accuracy through overfitting by modeling the "noise" in the data with too many predictors. (4) the model could have not guided any training.

> Because you've assumed a linear relationship with normally distributed errors.

Normality will give you stronger results, but generally linear regression doesn’t require normality. You only need errors to be uncorrelated for OLS to be the best linear unbiased estimator.

> how is it different from statistical forecasting

In statistical learning/forecasting, the researcher typically specifies the statistical model.

In machine learning, the statistical model is approximated by the algorithm.

Since a ML model needs to learn both the model form and the model parameters, it takes more data and also it does not allow for understanding (since it does not output the form of the model it learned).