Hacker News new | ask | show | jobs
by bobbruno 1320 days ago
While I tend to agree with most of the arguments that DS code is usually of low quality, and that DS are not well-trained in good development practices, I wonder if making them better coders is an attainable goal, or even a proper one. My reasons for that questioning: - Data science requires a significant stack of knowledge beyond coding - in fact, to be a useful DS in a company, you already have to learn about maths, business domains, keep up with the latest algorithms, know how to manipulate data, present, run experiments, analyse them, know deep statistics and some others I am probably forgetting. Adding the SW dev skills on top of that and expecting them to become good developers is a tall order, and only a small percentage of the DS community will achieve it. With the level of demand for ML, I don’t know if this will deliver on the market needs - it’s not that it’s not attainable, I think it’s not scalable; - People coming from a SW dev background tend to think DS is the same, just done by people who don’t code well. That is not true: code is the final product of software development, while it is but a tool for reaching the goal of finding a good ML approach for a DS. The consequence here is that SW dev has a much stronger reason for wanting good quality, maintainable code than DS does. When researching for a solution, many iterations of code written by DS will be discarded without ever having to go to production, and I don’t know if the overhead of keeping good tests, structuring the code, making small commits, etc., is justifiable in this scenario - the goal is not to have maintainable code, it is to see if the model+features has potential for solving the problem. - Evolution and maintenance are also a problem, because the structure that’s good for operations doesn’t help the job of research - it’s not common for a DS to work in a pipeline structure (which seems to be the emerging pattern for MLOps), and forcing them to use that structure on all iterations after the first will have significant productivity issues, to the point of putting success at risk;

I don’t have a solution for the points above, and I understand that, once a promising approach has been found, the code starts to matter much more, because Ops will require it to be automated and executed in a reliable way. For now, what I do is to do the research in a very loose way, not caring about good SW practices. When I find something good, I start refactoring the code to meet the Ops expectations. But I’m a CS major with decades of experience in coding and ML - it’s not reasonable to expect the entire DS community to develop the same skills, it takes too long.

Any ideas out there?