Hacker News new | ask | show | jobs
by fcolas 2622 days ago
+1, tomasdpinho. Yes to everything, and notably the queues everywhere, versioning the models, and the issue to mix sync and async (go for queues).

As a scientist designing risk management systems, I also like to:

. avoid moving the data;

. bring the (ML/stats) code to the data;

. make in-memory computations (when possible) to reduce latency (network+disk);

. work on live data instead of copies that drift out-of-date; and

. write software to keep models up to date because they drift with time too and that's a major, operationally un-noticed, and extremely costly problem.

I'm not yet into Tensor/ML-Flow, but I use R, JS, and Postgres, thereby relying on open-source eco-systems (and packages) that are:

. as standard as possible;

. well-maintained;

. with a long expected support; and

. as few dependencies as possible.

1 comments

+2 for bringing the (ML/stats) code to the data instead of the other way around
Could you speak to your experience with this particular list item?
We deal with fairly large volumes of data on a frequent basis so it would not make sense for each data scientist to create a copy within their own environment. Everyone works off a centralized data source and we provide them with Jupyter/Spark in an internal cloud environment.