Hacker News new | ask | show | jobs
by bertil 2620 days ago
The one I’m working with _now_ is very low tech: daily Python processing data from GCP, and writing back to GCP; a handful of scripts that check everything is reasonnable. That’s because we serve internal results, mostly read by humans.

The most impressive description that I’ve seen run live is described here: https://towardsdatascience.com/rendezvous-architecture-for-d...

I’d love to have feedback from more than Jan because I’m planning on encouraging it internally.

The best structure that I’ve seen is at scale (at a top 10 company) was:

- a service that hosted all models, written in Python or R, stored as Java objects (built with a public H2O library);

- Data scientists could upload models by drag-and-drop on a bare-bones internal page;

- each model was versioned (data and training not separate) by name, using a basic folders/namespace/default version increment approach;

- all models were run using Kubernetes containers; each model used a standard API call to serve individual inferences;

- models could use other models output as input, taking the parent-model inputs as their own in the API;

- to simplify that, most models were encouraged to use a session id or user id as single entry, and most inputs were gathered from a all-encompasing live storage, connected to that model-serving structure;

- every model had extensive monitoring for distribution of input (especially missing), output, delay to respond to make sure both matched expectation from the trained model;

e.g.: if you are training a fraud model, and more than 10% of output in the last hour was positive, warn the DS to check and consider calling security;

e.g.a.: if more than 5% of “previous page looked at” are empty, there’s probably a pipeline issue;

- there were some issues with feature engineering: I feel like the solution chosen was suboptimal because it created two data pipelines, one for live and one for storage/training.

For that problem, I’d recommend that architecture instead: https://www.datasciencefestival.com/video/dsf-day-4-trainlin...

1 comments

What does "model" mean? What kind of data are contained in a machine learning model? Second, how do you decide a model is robust? I'm asking because I'm looking at using ML to more efficiently use some quality assurance tools for a product line. The idea is to develop a model such that product A, B, C can have existing (or supplementary data) QA data plugged into a model, and then an appropriate sampling plan can be output.

An intern showed proof of concept of such a model based on one product, and it's fantastic work that could save thousands of dollars, but we're struggling with how to "qualify" it. How do we know we won't get a "garbage in/garbage out" situation?

So you want to figure out how often you need to sample products for QA?

A model is two things: a description of what's in the black box (could be a linear model, a neural network architecture, etc) and some weights which uniquely define "that specific model". Each model will have some known input (eg image, tabular data) and output (eg number, image, list etc).

You need to store both the structure and weights: for example your model is y = mx + c, but you need to know m and c to uniquely define it.

To answer your second question robustness means a smart test strategy. Train on representative data, validate during training on a second data set and test on hold-out data that the model has never seen.

Unfortunately it's quite hard to prove model robustness (in the case of deep learning anyway), you have to assume that you trained on enough realistic data.

If you really have no idea about robustness, then you should probably do a kind of soft-launch. Run your model in production alongside what you currently use, and see whether the output makes sense.

You could try, for example, sampling with your current strategy as well as the schedule defined by your ML model (so you lose nothing but a bit of time if the ML system is crap). Then compare the two datasets and see whether the ML model is at least performing the same as your current method.

Surely you can make some naive estimates of robustness though? eg if the model says sample 5% of your product, you then have a bound on the chance that you miss something.

1. What I’m working on at the moment is AB-testing, so no real models there; plenty of simulations and tests though.

2. There are several videos of Jan describing his work, including that one, so I’ll let him give examples of what he means by models: https://www.datasciencefestival.com/video/dsf-day-2-jan-teic...

3. At the big company, it’s an e-commerce website with many products along many dimensions, so models about what aspect of the product customers would be interested in, whether they are likely to commit to purchasing now or just browsing; price sensitivity against other factors. They typically have non-authenticated users, so they have to guess a lot about the users, from time of day, country of connection, type of device used, browsing rhythm — the inferences are not perfect, but they inform how the product is presented, and have a meaningful impact on conversion.

4. In the presentation at Trainline, there are not explicit about what models they have in mind, but it’s also an e-commerce company, so a lot of similar decision.

One unique problem they had talked about openly before (UK train companies are not really reactive but British people love their festivals, championship matches, protests, horse- and dog-races and drinking during all of the above): they deal with the occasional crowded train, so they are trying to predict if a train is going to swamped and if the person booking is going to the event in question. In the latter case, they’d rather avoid the loud fans or drunken top-hatted horse-owners.

For all of the above models, the models are trying to predict something that they can have ground-truth about (typically: buying behaviour), often based on data obtained minutes later. That means all are monitoring the model accuracy, typically off-line. In most cases, they are also monitoring the impact of the use of the model: better recommendations should lead to better conversion, but also, say, a higher MRR.