Hacker News new | ask | show | jobs
by NyxWulf 5165 days ago
As someone in the big data field on the ground (VP of Engineering). Let me give you my thoughts on it.

Your impression about the hype is correct. There are a lot of vendors offering BIG solutions, if you pay them BIGGER money. Where I used to translate the word enterprise to $$, now I translate Big Data to $$$$$$$.

When I'm hiring, I don't go looking for Big Data people, because generally they don't exist. Statistics is a really great general addition to a programmers toolkit. Machine Learning is valuable as well, although in my experience the application is more limited. What this article doesn't mention is a whole host of other skills required.

Modeling, and not just a formal mathematical model, but applying any type of model to your data to get insight. Check out the model-thinking class on coursera.

Exploratory Data Analysis, much different skill than confirmatory statistics.

Design of Experiments, specialized subfield within statistics.

Logistics, how to setup, maintain, and maximally utilize an efficient distributed cluster and build a pipeline getting your data to the cluster, cleaning it, building it into a model, and then extracting insight and delivering that end value.

Those are a couple of the skills at a high level. At a more nuts and bolts level, Hadoop is the defacto standard for Big Data. Learning how to build a data pipeline out of the Linux tool chain is very common in the data science world.

The overall value stream for Big Data is deep and wide. Most companies don't have expertise in much of these, and so at the current time you have to learn them yourself or find a company focused on building a team around it.

If you are just learning this yourself, you'll probably get an academic knowledge. If you want to make yourself valuable in the marketplace, you'll really want to get hands on experience. Knowing a z-score is one thing, building a process to gather data and compute a model against it is a whole different ball game. As the article mentions, if you have nice clean data it's easy to apply a model. If you have messy ugly data from 20 different vendors and 200 clients with various failures, anomalies, and you have to figure out what type of model is helpful, oh and you have a deadline because for 500th time someone promised something impossible to the client, then you have something closer to what Big Data is today.

* grammar edits

5 comments

deadline because for 500th time someone promised something impossible to the client

This is a killer in machine learning applications. The toolsets rarely cover the entire extent of what needs to be done, so at least some custom code needs to be written. But results aren't deterministic - you don't really know if it's going to work until you run it. Several iterations are often needed to get to the first useable results. It has all the problems of building any piece of software, plus another layer of risk that the accuracy just won't be there with the first thing(s) you try.

My point is... actually agreeing to be the machine learning guy on a project totally sucks because time estimates are almost meaningless, and the modern business culture is to label anything late as a failure.

I couldn't agree more. Accuracy is a problem, variation is another problem. Dealing with layers in the business who have no math or statistics background but very strong opinions is yet another complication.

These types of conversations aren't uncommon.

Other - "I need you to prove our stuff does X, Y, and Z".

Me - "Ok.."

<time elapses>

Me - "Ok the data shows our stuff does X but Y and Z are just random noise"

Other - "We ran it once before with this other guy and it showed our stuff did X,Y and Z. We've been promising it to our clients for a year. He gave us several examples, but when the clients asked to see the underlying data he couldn't produce it. So we just need you to prove it does X,Y, and Z."

Me - "The data only shows it does X. Y and Z are impacted positively through X, but once you condition on X, Y and Z are not causally affected by our stuff"

Other - "Yeah...well I promised client we would give them a report by {{insert random ridiculous date here}} proving it did X, Y and Z. We are going to lose them if we don't deliver a report saying that"

Me - trying for the 50th time to explain they shouldn't promise a positive result when we've never looked at the data.

There are hundreds of variations on this conversation. Your code is wrong is one variant (which depending on the timeline is hard to dispute). Of course if you take long enough that your code is correct, then you are going to slow. This isn't a science experiment, just make it work is another. Watching someone go slack jawed and start drooling because you accidentally used a math term is always interesting.

I have a whole new perspective of being on the cutting edge. It seems like it mostly means you are on the cutting edge of comments from people who don't know how ridiculously hard what you are doing is.

Man, this is statistics. You should be able to get any result you want!

I'm only half kidding. I can remember writing my first report (project summary) when I did a contract right after grad school. I put in maybe 5 graphs. Two looked good, three looked bad. The project manager just deleted the bad looking graphs and sent it on to the client.

"Some people use statistics like a drunk uses a lamp post - for support rather than for illumination."
The company I used to work for had a performance based product. They only got payed if they actually showed improved accuracy against a given evaluation set. Then they got a fraction of the cost savings (say 1 year's worth).

This seems like it could be a good model for machine learning consulting, and one that I would certainly be willing to explore.

It would work something like this :

  1) You show me your problem and your data.  

  2) We  come to an agreement on how accuracy would   translate 
into financial results and on a fair split of the savings or earnings.

  3) I develop a model.

  4) You evaluate it based on 2.

  5) I get payed based on 2.  
If my model doesn't meet minimum performance criteria I don't get payed. If it does very well, and assuming the problem was economically interesting in the first place, you save a lot of money and I get a fair sized chunk of it.

Feel free to explain why this business model wouldn't work.

Edited for formatting.

Most business people aren't interested in model accuracy as a term. They want something that provides benefit, e.g. cost savings, increased revenue, increased profits, etc.

The sales process of convincing someone they need an accurate model is tough, especially because robust models are time consuming and expensive to build.

If you can come up with a model that shows good results, and people know they need those results, then you can start a company selling either a service or product to get those results. If people don't know they need your results - then you have to educate them, in which case it's a much more difficult business to start.

I don't know many business people with the temperament, understanding, or the pocket book to deal with general research type problems.

I think there will be a day when this will work. But right now my concerns would be:

1) The people looking for outside help probably don't have ANY model, so baselines are difficult. (I use synthetic baselines like just predicting the average of the predicted variable every time, and that's a very valuable tool, but I don't think you could get paid by beating them.)

2) When would you cut your losses on a failed project and move on? That would be incredibly difficult to do on a project that you had spent weeks or months on and not been paid. It's like cutting your losses on a failed trade in the stock market... it seems like it would be easy and obvious until you actually experience it yourself.

3) Once a company was in a position to take you up on their offer, they might as well post it on Kaggle and get a hundred people to work on it for peanuts. I'm really rooting for Kaggle b/c one of their long term goals is to let people like you (and me) make a living doing analytics work like you describe. But right now they just don't have the volume and all the projects pay out just a few thousand dollars (and only if you beat the other hundred participants).

4) If I was a company, and didn't have the expertise in house to build the model myself, I'd be wary I was really getting what I paid for. If I'm paying for a 10% boost in accuracy, how do I measure it rather than just taking your word for it?

Another thing that increases difficulty is that depending on people's background and experience there are very different views on what is most important.

For example is my view exploratory data analysis and visualization are less important than using strong models and figuring out how to apply them to problems. I say this because I haven't seen any visualization methods that really tell you much about how hard or easy it will be to develop a predictive model. Sure you can do a 2-D LDA projection and if there's a huge amount of overlap you know you're not looking at a trivial problem. But if the problem is linearly separable someone's probably already got a good solution in Excel.

As for the "Big Data" buzzword it applies well to some problems like NLP or web analytics where massive datasets are available. In these cases it's clear that the more densely your data samples the problem space the better your performance will be and even very simple models will perform well.

However there are many applications where the amount of available training data is not so large and you need to use models which are powerful enough to discover non-obvious patterns. Applying such models and adequately evaluating them, which is critical to avoiding over-fitting with relatively small data sets, requires developing quite complex processes.

Thanks for the explanation. I'm trying to get into that business myself, and it's good to know where the remaining gaps in my knowledge are. (For me it's the "logistics" part.)
Some years ago, there was a similar wave of enthusiasm for "data mining." Plus ça change...
@NyxWulf: where do you work / email me / curious