| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by programnature 3581 days ago
	While its useful to have this kind of info, IMHO its still far from 'infrastructure for deep learning'. What about model versioning? What about deployment environments? We need to address the whole lifecycle, not just the 'training' bit. This is a huge and underserved part of the problem bc people tend to be satisfied with having 1 model thats good enough to publish.

5 comments

tlb 3581 days ago

Indeed, deployment is a whole set of interesting issues. We haven't deployed any learned models in production yet at OpenAI, so it's not at the top of our list.

If the data and models were small and training was quick (on the order of compilation time), I'd just keep the training data in git and train the model from scratch every time I run make. But the data is huge, training requires clusters of machines and can take days, so you need a pipeline.

An industrial strength system looks like this: https://code.facebook.com/posts/1072626246134461/introducing...

link

platypii 3581 days ago

CTO of Algorithmia here. We've spent a lot of time thinking about the issues of deploying deep learning models. There are a whole set of challenges that crop up when trying to scale these kinds of deployments (not least of which is trying to manage GPU memory).

It would be interesting to compare notes since we have deployed a number of models in production, and seem to focus on a related but different set of challenges. kenny at company dot com.

link

programnature 3581 days ago

Yes, understandable. I encourage viewing this as part of the 'open' mandate.

link

agibsonccc 3581 days ago

When you're thinking of "deployment" here - wouldn't it make sense to use the google compute engine for this?

I'd be curious to see if there's a legit speed up there with the "real tensorflow".

For "on prem" stuff I think "deployment" is going to depend on the actual end use case.

Eg:no one in industry will keep their "training data" in git. They'd have an actual database with other systems surrounding it.

If it's just "run the model locally to view a web page running in a docker container I wouldn't see the problem here though.

The infra will also be different for training vs inference. For training you'll want gpus, but it's not realistic to run gpus with inference yet.

I'd love someone to comment on: https://developer.nvidia.com/gpu-inference-engine

though.

There's going to be a lot of non deep learning "stuff" involved here.

Much of it will be connected to the use case. Eg: deep learning for log analytics in production will be different than a computer vision pipeline.

Warning: highly biased player in the space.

link

ymt123 3581 days ago

Have you tried Sacred[1]? It definitely doesn't answer the "infrastructure for deep learning" challenge but it is helpful for understanding what experiments have been run/where did this model come from (including what version of the code/parameters produced it)

[1] https://github.com/IDSIA/sacred

link

asimuvPR 3581 days ago

So true. I've been doodling some tools to somehow manage all of it. So far I only have git-like approaches to models and chef-like approaches to infrastructure. I hope to somehow bring all together into a docker-like package that can be deployed without much hassle.

link

daveguy 3581 days ago

You might want to check out Pachyderm -- that is essentially what they are trying to do (Analytics infrastructure support. It isn't specific to machine learning):

http://www.pachyderm.io/

link

asimuvPR 3581 days ago

I had forgotten about them. Thanks for posting the link.

link

vonnik 3581 days ago

Fwiw, we're testing a Dockerized the distro of DL4J. Runs on DCOS.

https://imgur.com/a/CDTAc

https://imgur.com/a/6jlxi

We'll release in coming weeks.

link

kyloon 3581 days ago

In terms of deploying trained models, you can probably get away with using TensorFlow Serving and let Kubernetes handle the orchestration and scaling part of the job. I do agree that there is certainly a need to have a tool that glues all these different bits and pieces together for improving the process of taking a model from development to production.

link

turinturambar 3581 days ago

Agreed. A very interesting and thoughtful post, but I think that you are right that OpenAI's primary use cases seem to be (unsurprisingly) academic research and rapid prototyping of new ideas. These emphasize very different set of problems than, say, deploying something in production or as a service.

Thus, this post seems immensely useful to someone like me (a PhD student, also primarily concerned with exploring new ideas and getting my next conference paper), but I can see how others doing machine learning in-the-wild or in production might see a lot of questions left unanswered. I, for one, work primarily with health care data from hospital EHRs, and I spent a lot more time with data prep pipelines than folks working with, say, MNIST.

link