| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brucej 2417 days ago

For our current projects we use our open source Apache Spark framework Arc (https://arc.tripl.ai/) for feature prep then depending on the type of model we will either:

- use builtin Spark ML models

- call a model running as a service

- write files for a model to ingest (for a legacy project)

- develop a custom plugin or UDF (for calling via SQL)

We have built in stages for running Spark ML models in the framework as well as HTTP and Tensorflow Serving stages to call services. We recently ran a series of models for NLP that were in Python and Ocaml via the HTTP stage sending payload either in JSON or other formats that the services needed. The text extraction via OCR (tesseract) had been done as a prior Spark stage. This design allows us to call these more custom ML models but keep them part of a larger Spark job and use SQL and other features when needed. The services where deployed in AWS Fargate to allow for scaling. For other jobs we are deploying our Arc jobs using Argo for orchestration. We spin up compute on demand vs running inside a persistent cluster.

For training we use Jupyter Notebooks where possible. We have a plugin that generates Arc jobs from these notebooks.

For special cases we can add custom plugins or UDF functions to extend the framework. I have done similar plugins to run XGBoost models in Spark for example.

Whilst we try to be prescriptive around the ML stack for Data Scientists this approach has allow flexibility where needed and for different teams to own their part of the job. This is particularly useful in larger teams where development is more federated.