| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fcolas 2622 days ago

+1, tomasdpinho. Yes to everything, and notably the queues everywhere, versioning the models, and the issue to mix sync and async (go for queues).

As a scientist designing risk management systems, I also like to:

. avoid moving the data;

. bring the (ML/stats) code to the data;

. make in-memory computations (when possible) to reduce latency (network+disk);

. work on live data instead of copies that drift out-of-date; and

. write software to keep models up to date because they drift with time too and that's a major, operationally un-noticed, and extremely costly problem.

I'm not yet into Tensor/ML-Flow, but I use R, JS, and Postgres, thereby relying on open-source eco-systems (and packages) that are:

. as standard as possible;

. well-maintained;

. with a long expected support; and

. as few dependencies as possible.

1 comments

tixocloud 2622 days ago

+2 for bringing the (ML/stats) code to the data instead of the other way around

link

thiggy 2620 days ago

Could you speak to your experience with this particular list item?

link

tixocloud 2619 days ago

We deal with fairly large volumes of data on a frequent basis so it would not make sense for each data scientist to create a copy within their own environment. Everyone works off a centralized data source and we provide them with Jupyter/Spark in an internal cloud environment.

link