|
|
|
|
|
by cyocum
1475 days ago
|
|
I occasionally work with data in the Humanities. The data here is often very, very small. I talk to other Humanities researchers and I often find that they really want to get on the ML bandwagon but they do not realize the sheer amount of data that they need to make ML as practiced today work. I have not looked into small dataset techniques in a long time (I have a day job so I do not get much chance to do this often) but I hope that one day we can find a technique that will work. One side note, when I speak to other Humanities researchers about this, I always tell them that I have yet to find a technique that will give them novel insights. These techniques almost always tell the researchers things that they already know. I usually follow this up with a note that even formalizing Humanities knowledge in statistical or other computational terms is highly valuable and worth doing. Maybe someone else can take that formalism and build on top of it something truly new. |
|
Yes, but sometimes in surprising ways.
I build a simple decision-tree model for a medical study, looking at outcomes for acute pneumonia. Went with a single tree over a forest because the model had to be interpretable. Statistically it was almost as good as the forest; I built it using fields with high feature importance values. Thus there is a chance that any 'improvement' by the forest was overfitting. but I digress.
The tree said that blood CO2 levels were the most important factor. The doctors weren't surprised by this (though they had some internal debate if this was more or less important than some other factors). What did surprise them was the cutoff level.
They said they would be concerned if CO2 was above 7. My model had the cutoff at 9.5. Sorry, I forget the units.
Point is, it confirmed what they knew (CO2 levels matter when assessing lung function), but still surprised them (CO2 levels have to be much higher than normal before this becomes discriminant over other factors, such as age).