Hacker News new | ask | show | jobs
by dandermotj 3432 days ago
I'm really looking forward to seeing the scientific community adopt docker as a way to distribute reproducible research and coursework.

MIT 6.S094 has a Dockerfile[^1] that contains all the software required for taking part in the class. This is a huge boon for getting stuck into the class and its coursework.

[^1]: http://selfdrivingcars.mit.edu/files/Dockerfile

3 comments

Most of the excitement that I've seen in the HPC scientific world has been around Singularity [1] containers. In particular, the main advantage seems to be keeping processes running as non-privileged users. This lets these containers get integrated with existing HPC clusters much easier.

[1] http://singularity.lbl.gov/

How is publishing a Dockerfile even remotely reproducible? Almost every Dockerfile is a series of apt-get install, or yum install or pip install commands. How do I know what versions of packages I am downloading or whether they will even be available to download if I build from this Dockerfile, say two months from now?

IMHO, every Dockerfile has left-pad written all over it.

Good question.

Reproduciblity is all about the starting point. Computers are electronic, so if your computation requires high entropy from some random source and supposed next run there is not enough entropy your experiment may fail. But that's really really really a corner case. Docker image keeps the state of the starting point (kernel, packages, history of bashrc etc) are kept version controlled. It is as if someone gave you a copy of the virtualbox image.

So how do we lock down?

1) When you start with a Dockerfile, specify the version of the packages you are installing

2) When you want to reproduce, you can rebuild an image with that Dockerdile.

3) But most people are just going to use your image which is always the same now or next year. Building image != launching a container using an image.

Currently a lot of research is computed using Condor to schedule jobs and yeah, they span across multiple machines, like how Jenkins master/slaves work. It's been a go-to for many HPC research.

There's been some effort individually to integrate Docker with Condor (after all, both are just processes running on some host machine).