| As someone in the big data field on the ground (VP of Engineering). Let me give you my thoughts on it. Your impression about the hype is correct. There are a lot of vendors offering BIG solutions, if you pay them BIGGER money. Where I used to translate the word enterprise to $$, now I translate Big Data to $$$$$$$. When I'm hiring, I don't go looking for Big Data people, because generally they don't exist. Statistics is a really great general addition to a programmers toolkit. Machine Learning is valuable as well, although in my experience the application is more limited. What this article doesn't mention is a whole host of other skills required. Modeling, and not just a formal mathematical model, but applying any type of model to your data to get insight. Check out the model-thinking class on coursera. Exploratory Data Analysis, much different skill than confirmatory statistics. Design of Experiments, specialized subfield within statistics. Logistics, how to setup, maintain, and maximally utilize an efficient distributed cluster and build a pipeline getting your data to the cluster, cleaning it, building it into a model, and then extracting insight and delivering that end value. Those are a couple of the skills at a high level. At a more nuts and bolts level, Hadoop is the defacto standard for Big Data. Learning how to build a data pipeline out of the Linux tool chain is very common in the data science world. The overall value stream for Big Data is deep and wide. Most companies don't have expertise in much of these, and so at the current time you have to learn them yourself or find a company focused on building a team around it. If you are just learning this yourself, you'll probably get an academic knowledge. If you want to make yourself valuable in the marketplace, you'll really want to get hands on experience. Knowing a z-score is one thing, building a process to gather data and compute a model against it is a whole different ball game. As the article mentions, if you have nice clean data it's easy to apply a model. If you have messy ugly data from 20 different vendors and 200 clients with various failures, anomalies, and you have to figure out what type of model is helpful, oh and you have a deadline because for 500th time someone promised something impossible to the client, then you have something closer to what Big Data is today. * grammar edits |
This is a killer in machine learning applications. The toolsets rarely cover the entire extent of what needs to be done, so at least some custom code needs to be written. But results aren't deterministic - you don't really know if it's going to work until you run it. Several iterations are often needed to get to the first useable results. It has all the problems of building any piece of software, plus another layer of risk that the accuracy just won't be there with the first thing(s) you try.
My point is... actually agreeing to be the machine learning guy on a project totally sucks because time estimates are almost meaningless, and the modern business culture is to label anything late as a failure.