Hacker News new | ask | show | jobs
by nostrademons 4786 days ago
I'm a little curious - have you done machine learning work that's made it into a real product?

I tried doing it a bit with my last project. My results were basically terrible. It turns out that getting useful results out of heterogenous, vague, fuzzily-specified real world data is really hard.

I'm tight with a few folks in Search Quality whose job is that sort of data-scientist machine-learning work, and they're really good at it. Y'know what 90% of their daily time is spent on? Compiling golden sets. Labeling training data. Running MapReduces to collect basic statistics about their data set. Running MapReduces to identify representative members of their data set, and outliers that should be excluded. Shoving data into R to visualize it. Futzing with numeric coefficients. Building webpages and tools so they can visualize the data and results, futz with the numbers online, and get feedback in real time. Collecting test sets and running your algorithm against them, and then trying to figure out why your losses are losing.

Machine-learning from a practitioner's POV is not at all like the textbooks and theoretical papers suggest. I'd estimate that less than 10% of one's time goes into the "fun" part of machine learning - brainstorming new signals and writing the code to extract them and feed them into your classifier - and 90% is on the kind of grunt work that hard science grad students do all the time. You get paid well for it, but that's because a lot of the work is really boring and time-consuming. I suspect that I get a far more frequent rush of accomplishment as a mostly-UI guy than the data scientists in my department get.

It's a tool. It works well in some cases, but it can take a lot of effort to get it to work well.

1 comments

I can only speak for myself here, but I find a lot of what you described as "grunt work" to be fun. Yes, it's true that when you work with real world data, you don't only work on the mathematical modeling, and you spend a lot of time just setting yourself up to be able to do what is typically called the "fun" part. For me, personally, I find all of the work involved to be fun-- even setting up a data pipeline and figuring out how to distribute a computation. Automating away grunt work and setup process is fun, too.

Part of what makes data science fun is the machine learning itself. And an equally interesting part of it is is that it involves so many other parts of computer science that a typical corporate job would call "too hard" and isolate you from.

At least based on what I've seen, data scientists are respected and this means they get dibs on the most interesting projects. However, the interesting projects themselves involve a lot of detailed work (that's the nature of technology) and if you're that interested, you're going to want to do it yourself, at least until you really understand the problem (at which point, you'll automate the dull stuff).