Hacker News new | ask | show | jobs
by NegatioN 2418 days ago
We train models as kubernetes cronjobs defined by a minimal properties file per model defining number of cps/gpus/mem. They will start with a given image (ex pytorch or tf) based on where in the repository these files are placed, and will then run a user specified bash file to start the job.

Data scientists have a similar docker image running in kubernetes which includes all of these images as conda environments for experimenting in prod-like environments. Spark is used to fetch data for the most part.

Models report a finished state over Kafka after getting persisted to buckets in Google cloud, then gets mirrored over to a ceph cluster connected to our serving kubernetes cluster.

We have an in house Golang server binding to c++ for serving pytorch neural nets persisted with the torch.jit API (I can really recommend this for hassle-free model serving). We also have some Java apps for serving normal ALS or Annoy based models.

Our traffic is not as wild as many here, but we're serving around 10M user requests a day.

We also do a merging of results from several models' results, and join them together with a separate "meta-model" that estimates which model the user has had a preference for recently, to weight those up.

There's probably a lot of details left out here, especially about the serving part, since we have various services in front of the models enriching data and presenting it to the user, but it's the gist of it.