Hacker News new | ask | show | jobs
by rm999 4083 days ago
Meh. The more I do machine learning in industry the more I realize how little the ML part matters compares to everything else. A typical project I've seen takes 3-6 months and contains thousands lines of code, but the machine learning part will take a week or two and be 100 lines of code. What Amazon ML is doing would probably take an hour and 30 lines of R code you can easily find online.

And here's the not-too-hidden secret: the ML part is the fun part. It's a big reason we spend months creating banking.csv. Josh Willis did a very funny presentation at MLconf partly about this. It's like waiting in line at a theme park for an hour, and then paying someone to cut in line at the last minute and record the ride for you. https://www.youtube.com/watch?v=4Gwf5zsg4vI&feature=youtu.be...

9 comments

The hardest part in machine learning is not training the model but debugging the model. How do you improve precision/recall after the first cut? Do you need more training data? Is some of your training data bad? Is it properly distributed? Does your feature have bug? Are you missing features to cover some cases? Is your feature selection effective? Did you tuned parameters carefully?

All these scenarios are difficult to debug because it's "statistical debugging". There are no breakpoints to put or watch windows to look at. There is no stack trace and there are no exceptions. Any Joe can train a model given training data, it takes fair bit of genius to debug these issues and push model performance to next level. Unfortunately all these new and old "frameworks" almost completely ignore this debugging part. I think the first framework that has great debugging tools will revolutionize ML like Borland revolutionized programming with its visual IDEs.

This. The pity is that immediately we get the results after a week the project is over and we move back to data wrangling hell!
You hit the nail on the head. Completely agrees with all my experience at Kaggle and applying machine learning across a broad number of industries
Maybe it's just me, but the "tedious" feature design and extraction IS the fun part. Am I the only one?

I mean, it's time consuming and frustrating, but it's also the essence of ML work and the place where I get to apply creativity and gain insight.

Agree 100%, in that light, anyone know how far we are away from having data wrangling be more automated? I saw a demo for a product called Paxata a few weeks ago, it looked like a good start. Anyone know more about things like that?
There are lots of new attempts at data wrangling approaches/tools, each with different caveats - Datameer, Platfora, Trifacta..
I can say that this is my day job now.
I think the "1 part fun 9 parts of perspiration" ratio is typical of most software fields - especially fields working in established industries. That's why dealing with software in a professional context is called a job and not an enjoyable hobby which it otherwise would be :)
I think this is one of advertised advantages of deep learning: it will find useful and unobvious features in your data corpus without much effort from your side.
I think that works in theory, but in many real world cases it actually takes a human to map the data into a subset of salient features. It's not simply a matter of excluding irrelevant dimensions.
Actually, with deep learning the fact of success is leading the theory. We don't know why deep learning works as much as we know it does work.

Edit: in certain domains such as images and speech

Isn't the point here that you can do it on huge datasets that don't work nicely with R
There are plenty of tools for that already. The point here is to make it as easy as possible.

I guess this could be useful for some people, but it seems rudimentary to me. If I'm reading their FAQ right they're just fitting a logistic regression to everything. I'm hoping this is just a starting point. Also, not being able to export the actual model seems like a huge dealbreaker to me.

My guess is they're using liblinear or vowpal wabbit under the hood. Both support SGD-based learning and work well in a streaming setting where data could be on disk or in memory.
do you mean that it takes more work to do the stuff surrouding the machine learning like gathering data to build a dataset that takes months and other resources where as the fun stuff is actually very short and easy to do.

I smell commoditization.

Can you elaborate?