| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mjw 5165 days ago

As an engineer who's investing in developing "deep expertise in statistics and machine learning" I can only stand to benefit from it, but something about the current wave of Big Data hype makes me instinctively a bit wary.

Does this skills shortage really exist to the extent claimed? are there really enough people out there who would know what to do with a 'data scientist' if they were able to hire one? I see more talk than action, I see vendors circling around looking to flog freshly-buzzword-compliant BI tools, prognosticators trying to push nervous businesses into engaging in an arms race over data.

Of course there's real value there too, for some at least. I hope my concerns prove unfounded, but worth retaining a healthy skepticism I feel :-)

3 comments

NyxWulf 5165 days ago

As someone in the big data field on the ground (VP of Engineering). Let me give you my thoughts on it.

Your impression about the hype is correct. There are a lot of vendors offering BIG solutions, if you pay them BIGGER money. Where I used to translate the word enterprise to $$, now I translate Big Data to $$$$$$$.

When I'm hiring, I don't go looking for Big Data people, because generally they don't exist. Statistics is a really great general addition to a programmers toolkit. Machine Learning is valuable as well, although in my experience the application is more limited. What this article doesn't mention is a whole host of other skills required.

Modeling, and not just a formal mathematical model, but applying any type of model to your data to get insight. Check out the model-thinking class on coursera.

Exploratory Data Analysis, much different skill than confirmatory statistics.

Design of Experiments, specialized subfield within statistics.

Logistics, how to setup, maintain, and maximally utilize an efficient distributed cluster and build a pipeline getting your data to the cluster, cleaning it, building it into a model, and then extracting insight and delivering that end value.

Those are a couple of the skills at a high level. At a more nuts and bolts level, Hadoop is the defacto standard for Big Data. Learning how to build a data pipeline out of the Linux tool chain is very common in the data science world.

The overall value stream for Big Data is deep and wide. Most companies don't have expertise in much of these, and so at the current time you have to learn them yourself or find a company focused on building a team around it.

If you are just learning this yourself, you'll probably get an academic knowledge. If you want to make yourself valuable in the marketplace, you'll really want to get hands on experience. Knowing a z-score is one thing, building a process to gather data and compute a model against it is a whole different ball game. As the article mentions, if you have nice clean data it's easy to apply a model. If you have messy ugly data from 20 different vendors and 200 clients with various failures, anomalies, and you have to figure out what type of model is helpful, oh and you have a deadline because for 500th time someone promised something impossible to the client, then you have something closer to what Big Data is today.

* grammar edits