Hacker News new | ask | show | jobs
by JPKab 4520 days ago
Just a bit of a counterpoint (taken from a comment on the Data Tau site):

"Data kiddies like me are coming. I just ran multiple passes of the Broyden–Fletcher–Goldfarb–Shanno algorithm with a 100-layer neural network on a tfidf-vectorized dataset. I have no clue what that all exactly means, all I know is that it took under an hour and it gives a higher (top 10%) AUC score. Kaggler amateurs are beating the academics by brute force or smarter use of the many tools that are currently freely available. Show a regular Python dev some examples and library docs and she can compete in ML competitions. I was getting good results with LibSVM before I even understood how SVM's work on the surface. Feed the correct input format and some parameters and you are good to go. Random Forests can be applied to nearly anything and get you 75%+ accuracy. Maybe I am just a engineer looking for pragmatic and practical use of techniques from ML and data science. Hard data scientists will be the statisticians, the algorithmic theory experts, the experimental physicists. It takes me 7 years to understand a complex mathematical paper. It takes me 7 minutes to train a model and predict a 1 million test set with Vowpal Wabbit."

The point is that a Data Scientist is really a person who is a blend of statistician and software engineer. Sure, there are brilliant people who will invent new ML algorithms, but you don't need to invent that stuff to be of tremendous value to a business who has data that they aren't currently getting much value out of. Just as a software engineer at a small business doesn't need to write a database, she just needs to be able to implement one somebody else wrote to add tremendous value.

2 comments

Sure, most anyone can throw some data into an SVM and get a result out, maybe even a good one. The problem comes when someone like this has to answer questions beyond a simple 90% accuracy rate. What does the computed separation direction tell me? Could I improve accuracy by using some a priori information like how often one class occurs in relation to the other? What 10% of the population am I failing on? Is it an important part? Is there some easy way I could do better? Is my data so high dimensional that I'm getting some trivial separation and not anything driven by the data itself?

And what happens when this person gets a new data set and they are suddenly getting garbage out of some standard SVM? Is it just a matter of the data not being well-separated using a linear model but throwing some simple kernel at the SVM will do the trick?

Even something as simple as taking a mean can fall apart when you are dealing with data which doesn't live in a Euclidean space, let alone something like PCA or SVM which also make assumptions of linearity.

The point is, it isn't just about being able to invent new methods. Things like SVM make assumptions about your data and applying them in cases when these assumptions don't hold can give completely worthless information, even if it looks good on the surface. Using something you don't understand, even if it is at a (much) more basic level than someone with a PhD in statistics, is just asking for trouble.

I absolutely agree, and in that sense I'm sort of living the dream already. But even skilled people don't always know what they don't know, and that can show though somewhat more easily in this field.