Hacker News new | ask | show | jobs
by vietjtnguyen 3202 days ago
I don't really work in this domain so maybe I'm missing something. If the goal is to essentially get the bare minimum needed to run a program into a Docker image why not develop your program in your desired environment and then use something like CDE [1] to copy (or obtain a list of) all the files touched in the desired invocation of the program. That copy or list can then be put into a tarball and imported with "docker import". Philip Guo even writes about this possible use [2].

Here's a silly example:

  cde python -c "import numpy as np; print(np.random.randn(3, 3).tolist())"
  pushd cde-package/cde-root/; tar cavf ../../cde-image.tar *; popd
  docker import cde-image.tar $USER:python-randn33
  docker run $USER:python-randn33 python -c "import numpy as np; print(np.random.randn(3, 3).tolist())"
  docker run -t -i $USER:python-randn33 python
If you look at the resulting "cde-image.tar" you'll find it to be quite bare. Mines had only 387 entries (files and folders).

[1]: http://www.pgbovine.net/cde.html

[2]: http://pgbovine.net/automatically-create-docker-images.htm

3 comments

Probably because syscall interception is not sufficient to create robust Linux program images. It will be an awkward moment if a stat, open, etc. that the program attempts in production doesn't work as expected because it wasn't run in development / bundling images. You'd have to execute every possible code path in the CDE bundling step to work properly.
So it becomes a matter of whether or not you can achieve good coverage of your execution paths to account for all possible filesystem touches? Further invocations of "cde" with respect to the same "cde-package" folder will actually append to the "cde-root" file system copy so if you could manage to canvas your program's execution paths then the resulting file tree copy should be sufficient?
You're right it is a question of coverage of execution paths, but that's a non-trivial problem.

Have a look at the lengths that AFL uses to get even close: http://lcamtuf.coredump.cx/afl/

[tl;dr it intruments execution while using a genetic algorithm to mutate inputs optimising for code coverage]

Statically determining dependencies is a lot easier and a lot more reliable! Particularly as you only need the base image once, and any extras on top are another layer on the Docker FS.

I'm also a fan of minimal images. Cde is an iteresting solution, but for dynamic languages like python packing everything into a virtualenv and shipping that is a reasonable solution. To automatically grab linked libraries you can use something like smith[1]

[1]: https://github.com/oracle/smith

I imagine that if you created a product that could run my Python code without building a full container or image, you might call it "serverless."
Or a Unikernel