| HN Mirror

Your article eloquently sums up my thoughts on the matter. Data science is like software engineering, but with bugs that tend to have particularly far error localities. You don't get a clean orderly traceback, instead you get vague symptoms, where the model behaves "weird" for no obvious reason.

I've found encapsulating the data preprocessing steps in pure functions helps ensure that the data cleaning can easily/quickly be debugged. When it comes to the actual model, there is no substitute for thoroughly understanding the characteristics of your dataset. Finally when it comes to model selection, a good scoring metric is absolutely necessary; this is entirely dependent on what you're actually trying to accomplish with the model. So there is little universal advice.

When it comes to the long iteration cycle, the only bandaid I've been able to find is a solid test suite, and thorough code review. This makes it less likely you'll introduce unintended problems. Basically you have to move slow and deliberately, instead of "move fast and break things."

Far from impossible, but certainly more difficult than "traditional" software engineering. It's like transitioning from debugging interpreter tracebacks generated by a simple toy script, to debugging a >40k LOC application written in a dynamic language, which happens to be intermittently segfaulting in production.