Hacker News new | ask | show | jobs
by boxy310 2951 days ago
Well - there _is_ a bit of Student's Paradox involved in requirements-gathering for machine learning or analytics, and it's not always apparent that you're using the right algorithm for the problem or what specific metrics you actually need to be optimizing. Most of the time when I've gone through User Story collection processes, the actual end-state users describe is vague, not based on hypothesis testing, and not supportable by the data available onhand. A big part of this process is discovering what the data _can_ tell you, and if necessary to rewrite the technical requirements entirely to align with the actual business need.

Even when a customer's given me a "clean" dataset, I've had to write 400+ lines of code to do the feature engineering on a relatively straightforward logistic regression. Then there's all the other times when a customer asks me to deploy one type of algorithm, and their business problem is actually solved by an entirely different class of algorithm entirely.

Zayd over at Stanford has a nice blog post [1] describing why machine learning is several more dimensions of complexity compared to traditional software development. There _is_ a specific set of Data-first skills that is complemented by dev and CS experience, but a fundamental reason why ML projects fail is due to lack of appreciation for the many different skillsets needed to succeed.

[1] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.ht...

1 comments

Your article eloquently sums up my thoughts on the matter. Data science is like software engineering, but with bugs that tend to have particularly far error localities. You don't get a clean orderly traceback, instead you get vague symptoms, where the model behaves "weird" for no obvious reason.

I've found encapsulating the data preprocessing steps in pure functions helps ensure that the data cleaning can easily/quickly be debugged. When it comes to the actual model, there is no substitute for thoroughly understanding the characteristics of your dataset. Finally when it comes to model selection, a good scoring metric is absolutely necessary; this is entirely dependent on what you're actually trying to accomplish with the model. So there is little universal advice.

When it comes to the long iteration cycle, the only bandaid I've been able to find is a solid test suite, and thorough code review. This makes it less likely you'll introduce unintended problems. Basically you have to move slow and deliberately, instead of "move fast and break things."

Far from impossible, but certainly more difficult than "traditional" software engineering. It's like transitioning from debugging interpreter tracebacks generated by a simple toy script, to debugging a >40k LOC application written in a dynamic language, which happens to be intermittently segfaulting in production.