Hacker News new | ask | show | jobs
by mjw 5165 days ago
As an engineer who's investing in developing "deep expertise in statistics and machine learning" I can only stand to benefit from it, but something about the current wave of Big Data hype makes me instinctively a bit wary.

Does this skills shortage really exist to the extent claimed? are there really enough people out there who would know what to do with a 'data scientist' if they were able to hire one? I see more talk than action, I see vendors circling around looking to flog freshly-buzzword-compliant BI tools, prognosticators trying to push nervous businesses into engaging in an arms race over data.

Of course there's real value there too, for some at least. I hope my concerns prove unfounded, but worth retaining a healthy skepticism I feel :-)

3 comments

As someone in the big data field on the ground (VP of Engineering). Let me give you my thoughts on it.

Your impression about the hype is correct. There are a lot of vendors offering BIG solutions, if you pay them BIGGER money. Where I used to translate the word enterprise to $$, now I translate Big Data to $$$$$$$.

When I'm hiring, I don't go looking for Big Data people, because generally they don't exist. Statistics is a really great general addition to a programmers toolkit. Machine Learning is valuable as well, although in my experience the application is more limited. What this article doesn't mention is a whole host of other skills required.

Modeling, and not just a formal mathematical model, but applying any type of model to your data to get insight. Check out the model-thinking class on coursera.

Exploratory Data Analysis, much different skill than confirmatory statistics.

Design of Experiments, specialized subfield within statistics.

Logistics, how to setup, maintain, and maximally utilize an efficient distributed cluster and build a pipeline getting your data to the cluster, cleaning it, building it into a model, and then extracting insight and delivering that end value.

Those are a couple of the skills at a high level. At a more nuts and bolts level, Hadoop is the defacto standard for Big Data. Learning how to build a data pipeline out of the Linux tool chain is very common in the data science world.

The overall value stream for Big Data is deep and wide. Most companies don't have expertise in much of these, and so at the current time you have to learn them yourself or find a company focused on building a team around it.

If you are just learning this yourself, you'll probably get an academic knowledge. If you want to make yourself valuable in the marketplace, you'll really want to get hands on experience. Knowing a z-score is one thing, building a process to gather data and compute a model against it is a whole different ball game. As the article mentions, if you have nice clean data it's easy to apply a model. If you have messy ugly data from 20 different vendors and 200 clients with various failures, anomalies, and you have to figure out what type of model is helpful, oh and you have a deadline because for 500th time someone promised something impossible to the client, then you have something closer to what Big Data is today.

* grammar edits

deadline because for 500th time someone promised something impossible to the client

This is a killer in machine learning applications. The toolsets rarely cover the entire extent of what needs to be done, so at least some custom code needs to be written. But results aren't deterministic - you don't really know if it's going to work until you run it. Several iterations are often needed to get to the first useable results. It has all the problems of building any piece of software, plus another layer of risk that the accuracy just won't be there with the first thing(s) you try.

My point is... actually agreeing to be the machine learning guy on a project totally sucks because time estimates are almost meaningless, and the modern business culture is to label anything late as a failure.

I couldn't agree more. Accuracy is a problem, variation is another problem. Dealing with layers in the business who have no math or statistics background but very strong opinions is yet another complication.

These types of conversations aren't uncommon.

Other - "I need you to prove our stuff does X, Y, and Z".

Me - "Ok.."

<time elapses>

Me - "Ok the data shows our stuff does X but Y and Z are just random noise"

Other - "We ran it once before with this other guy and it showed our stuff did X,Y and Z. We've been promising it to our clients for a year. He gave us several examples, but when the clients asked to see the underlying data he couldn't produce it. So we just need you to prove it does X,Y, and Z."

Me - "The data only shows it does X. Y and Z are impacted positively through X, but once you condition on X, Y and Z are not causally affected by our stuff"

Other - "Yeah...well I promised client we would give them a report by {{insert random ridiculous date here}} proving it did X, Y and Z. We are going to lose them if we don't deliver a report saying that"

Me - trying for the 50th time to explain they shouldn't promise a positive result when we've never looked at the data.

There are hundreds of variations on this conversation. Your code is wrong is one variant (which depending on the timeline is hard to dispute). Of course if you take long enough that your code is correct, then you are going to slow. This isn't a science experiment, just make it work is another. Watching someone go slack jawed and start drooling because you accidentally used a math term is always interesting.

I have a whole new perspective of being on the cutting edge. It seems like it mostly means you are on the cutting edge of comments from people who don't know how ridiculously hard what you are doing is.

Man, this is statistics. You should be able to get any result you want!

I'm only half kidding. I can remember writing my first report (project summary) when I did a contract right after grad school. I put in maybe 5 graphs. Two looked good, three looked bad. The project manager just deleted the bad looking graphs and sent it on to the client.

"Some people use statistics like a drunk uses a lamp post - for support rather than for illumination."
The company I used to work for had a performance based product. They only got payed if they actually showed improved accuracy against a given evaluation set. Then they got a fraction of the cost savings (say 1 year's worth).

This seems like it could be a good model for machine learning consulting, and one that I would certainly be willing to explore.

It would work something like this :

  1) You show me your problem and your data.  

  2) We  come to an agreement on how accuracy would   translate 
into financial results and on a fair split of the savings or earnings.

  3) I develop a model.

  4) You evaluate it based on 2.

  5) I get payed based on 2.  
If my model doesn't meet minimum performance criteria I don't get payed. If it does very well, and assuming the problem was economically interesting in the first place, you save a lot of money and I get a fair sized chunk of it.

Feel free to explain why this business model wouldn't work.

Edited for formatting.

Most business people aren't interested in model accuracy as a term. They want something that provides benefit, e.g. cost savings, increased revenue, increased profits, etc.

The sales process of convincing someone they need an accurate model is tough, especially because robust models are time consuming and expensive to build.

If you can come up with a model that shows good results, and people know they need those results, then you can start a company selling either a service or product to get those results. If people don't know they need your results - then you have to educate them, in which case it's a much more difficult business to start.

I don't know many business people with the temperament, understanding, or the pocket book to deal with general research type problems.

I think there will be a day when this will work. But right now my concerns would be:

1) The people looking for outside help probably don't have ANY model, so baselines are difficult. (I use synthetic baselines like just predicting the average of the predicted variable every time, and that's a very valuable tool, but I don't think you could get paid by beating them.)

2) When would you cut your losses on a failed project and move on? That would be incredibly difficult to do on a project that you had spent weeks or months on and not been paid. It's like cutting your losses on a failed trade in the stock market... it seems like it would be easy and obvious until you actually experience it yourself.

3) Once a company was in a position to take you up on their offer, they might as well post it on Kaggle and get a hundred people to work on it for peanuts. I'm really rooting for Kaggle b/c one of their long term goals is to let people like you (and me) make a living doing analytics work like you describe. But right now they just don't have the volume and all the projects pay out just a few thousand dollars (and only if you beat the other hundred participants).

4) If I was a company, and didn't have the expertise in house to build the model myself, I'd be wary I was really getting what I paid for. If I'm paying for a 10% boost in accuracy, how do I measure it rather than just taking your word for it?

Another thing that increases difficulty is that depending on people's background and experience there are very different views on what is most important.

For example is my view exploratory data analysis and visualization are less important than using strong models and figuring out how to apply them to problems. I say this because I haven't seen any visualization methods that really tell you much about how hard or easy it will be to develop a predictive model. Sure you can do a 2-D LDA projection and if there's a huge amount of overlap you know you're not looking at a trivial problem. But if the problem is linearly separable someone's probably already got a good solution in Excel.

As for the "Big Data" buzzword it applies well to some problems like NLP or web analytics where massive datasets are available. In these cases it's clear that the more densely your data samples the problem space the better your performance will be and even very simple models will perform well.

However there are many applications where the amount of available training data is not so large and you need to use models which are powerful enough to discover non-obvious patterns. Applying such models and adequately evaluating them, which is critical to avoiding over-fitting with relatively small data sets, requires developing quite complex processes.

Thanks for the explanation. I'm trying to get into that business myself, and it's good to know where the remaining gaps in my knowledge are. (For me it's the "logistics" part.)
Some years ago, there was a similar wave of enthusiasm for "data mining." Plus ça change...
@NyxWulf: where do you work / email me / curious
I felt a similar kind of skepticism when I saw it took ~3 years to improve the Netflix recommendation system with just ~10% - in the context of the Netflix Prize, with great minds (data scientists and practitioners) participating and collaborating.

Maybe the initial system was quite good and it had no space for easy-and-fast enhancements, I don't know.

But 10% overall improvement result in 3 years (just as quantitative ratio, esp. if it translates directly to the same growth pattern in financial revenues) is something that makes the business types yawning.

There's a fantastic paper, Hand 2006, which notes the strong tendency for simple models to get nearly all of the performance possible out of solvable problems.

Hard problems do better with complex algorithms but there's also just less to be gained.

The best solution tends to be simple models applied to the right kind of data such that the problem has become easy. This is sometimes pretty difficult though since the simple models are designed on simple data, which might not always be what you've got.

But what if 10% improvement means 10 M$/year ?

Anyway I think there are many applications where getting the absolute best performance isn't as important as finding the problem, figuring out how to apply a machine learning model to it (which includes getting the necessary training data) then training an off the shelf mode. The later of these may take a day or less, the other phases may well require both more thought and more effort.

10% increase in 3 years translates to around. 3.3% yearly growth rate. So in your example the 3.3% increase would be that 10M$ => so 1% of your annual business revenue is 10/3.3 or just above 3M$.

But that means you already have a really significant business that makes ~300 M$ per year. And you manage to increase it just by peanuts (relatively speaking).

And there is inflation in economy, and the alternative costs of not investing such a huge sum or part of in Apple stocks (for example) during those years.

My point explained better:

The startup success of getting from zero to millions just because of clever ML/data-science/statistics is something to be respected and admired. But for already big-business all this big-data buzz might provide just minor enhancement opportunities at best.

Of course all these numbers are hypothetical, I have no idea what the actual Netflix numbers are, but aren't you assuming 100% profit margin ?

If you actually have a machine learning application that increases annual revenue by 10% and your initial annual revenue is $300M (like in the example) and your profit margin is 50% then (neglecting the $1M cost of the model because it's small and amortized over many years) your annual profits go from $150M to $180M which is a 20% increase. I don't think that is a number to yawn about.

On your last point I actually think the opposite is true. The larger a company's operations the more potential cost savings there are. If profit margins are slim, as they are in many industries, the effect on earnings of relatively small cost savings or revenue increases can be large indeed.

Well, the widely-known real example I mention is the Netflix Prize result -> 3.3% CAGR (cumulative annual growth rate, the 10% is for the whole period of 3 years, not for 1 year).

The numbers you mention are the hypothetical ones. Go and find a real publicly documented example that is close as values and margins to what you describe and I might agree.

NB. #1) Google and web-search, or some similar startup success story, as example doesn't count - they're the 2-person startup gone wildly successful, not a previously big entity that hired 2 ML-geniuses to open their eyes.

NB. #2) If I could bring a revolutionary increased value to a company - through vastly enhanced data processing and analysis - rather than doing a consultancy and educate somebody I'd rather enter the industry as competitor and prove the "old" guys don't understand the business anymore.

Last month Forbes reported that Netflix said 75% of what it's customers watch are from recommendations. Definitely some bang there.

Also, note that 10% improvement was far from linear: year 1: +8.43%, year2 +1% and year3 +0.6% (!!).

An interesting observation, indeed - so the enhancement opportunities were effectively explored within the first year or so. The next two years count for far less, though I guess the cleverest approaches started to emerge just then.
At a conference I attended last month, one of the keynotes estimated that there might be 250 people in the country with the skills need to build non-trivial, ontology-based data systems. Even if that is an wild exaggeration, it is at least evidence of a perceived shortage.

Also note that an ability to transfer domain experts' knowledge into working models is at least as important as the Stats+ML bits.

I'd say the current academic research in ML is not oriented towards producing people who can use ML in real applications.

I've hovered around the periphery of a world-leading ML research group, and the first takeaway I have is that 7 years ago I thought the stuff they were working on was going to take the world by storm, but looking back, I can say it hasn't.

This group does a number of research projects on narrowly defined topics. 4 out of 5 of these projects try out some refinement of the method that doesn't really work. Maybe 1 out of 5, if that, point to a real improvement.

The big thing that's lacking are serious attempts to push the state of the art by attacking a problem holistically and "taking no prisoners" -- yet this is exactly the kind of thinking necessary to commercialize ML.

The leader of the group got tenure so he thinks everything is going OK. He won't even offer an analysis of why this technology hasn't been widely commercialized. PhD students from this group usually interview at Google, Microsoft and Facebook but these three employers are the only ones they consider as an alternative to academic employment.

There's definitely a need for people to continue working on developing new models and ideas. I think that's where academic research fits. That said the academic world could do a much better job of effectively evaluating and comparing models so that practitioners have a clearer view of what works where. There also seem to be a lot of biases that get perpetuated in the academic world, like the bias against neural networks.

However I think what's really needed for this technology to develop to its true potential is figuring out how to apply it to real problems and I think that's more a role for industry practitioners than academics. The problem is that for people to make a living at this there needs to be a market. I think what we are seeing in this area is a shortage of both supply and demand with the supply side hindering the demand side and vice versa.

I agree with Paul's comment. The 250 number feels low to me, but that is applying a specific model. Typically people come with some set of favorite models, and many of them provide that vast majority of the benefit a business needs. Especially when the current model in use is slipshod and busted at best.
The 250 number refers to the medical field. So the requisite background includes at least biology, and more preferably clinical experience. That diminishes the pool somewhat.

A similar phenomena certainly manifests in other highly specialized fields though. There are far fewer people with both the skills we're talking about, and deep water E&P or big 3 audit experience, for instance.