Hacker News new | ask | show | jobs
by atak1 2418 days ago
Background:

We have a team of 6 DS working on parallel projects. We tried the Java service approach. It's great for a one-time model, but very painful to iterate on.

We develop on top of Sagemaker, and since we're a funded company, can somewhat get away with the 40% price increase of an "ML Instances".

We have a mix of R/Python models. For each, we keep a separate repo with a Dockerfile, build file, and src code.

Training:

If the jobs are small, we train them locally, package assets into the container, and deploy. If it's a bigger job, we leverage Sagemaker training jobs and S3 for model storage.

Serving:

We have boilerplate web service layer with an entrypoint that DS fills in with their own code. Yes, this allows almost arbitrary code to be written, but we do force code reviews and enforce standards. Convention over configuration.

We do the feature engineering using Python/R, which when parallelized, has good enough performance (sub 200ms latency on Sagemaker prod). If we need latencies in the 1-10ms range, we'd consider refactoring the feature engineering into a separate layer written in a more performant language. It's always FE that takes the most time.

One learning from tuning Python services is: for max performance, try to push the feature engineering work onto consumers. Have them fully specify the shape of the data in a format that your models expect so you do as little feature engineering in the serving step.