| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kmax12 2622 days ago

I see this as a very timely question. As ML has proliferated, so has the number of ways to construct machine learning pipelines. This means that from one project to another the tools/libraries change, the codebases start looking very different, and each project gets its own unique deployment and evaluation process.

In my experience, this causes several problems that hinder the expansion of ML within organizations

1. When a ML project inevitably changes hands, there is a strong desire from the new people working on it to tear things down and start over

2. It's hard to leverage and build upon success on previous projects e.g "my teammate built a model that worked really well for predict X, but I can use any of the that code to predict y"

3. Data science managers face challenges tracking progress and managing multiple data science projects simultaneously.

4. People and teams new to machine learning have a hard time charting single a path to success.

While coming up with a single way to build machine learning pipelines may never be possible, consolidation in the approaches would go a long way.

We've already seen that happen in the modeling algorithms part of the pipeline with libraries like scikit-learn. It doesn't matter what machine learning algorithm you use, it the code will be fit/transform/predict.

Personally, I've noticed this problem of multiple approaches and ad-hoc solutions to be most prevalent in the feature engineering step of the process. That is one of the reasons I work on an open source library called Featuretools (http://github.com/featuretools/featuretools/) that aims to use automation to create more reusable abstrations for feature engineering.

If you look at our demos (https://www.featuretools.com/demos), we cover over a dozen different use cases, but the code you use to construct features stays the same. This means it is easier to reuse previous work and reproduce results in development and production environments.

Ultimately, building successful machine learning pipelines is not about having an approach for the problem you are working on today, but something that generalizes across the all the problems you and your team will work on in the future.