Hacker News new | ask | show | jobs
by neonfreon 4076 days ago
I don't think there is any way to prove that this found all the required files. The more paths through the code, each with its own potential file accesses that can't be predicted with out run time information, the more likely one will be missed in this optimization stage.
3 comments

Variation on the halting problem? Given infinite running time and arbitrary input, can you prove that a program will never access file X.
Agreed that this seems equivalent to me. Assume instead of the instruction "access file X", the instruction is HALT.
There's no general algorithm, but you could probably prove it, if you tried really hard for your given example. :D
I'm also interested if it's possible to say so with certainty.
You have runtime information though. It's true that this method will not find all things (dlopened files in strange codepaths), but just like we have tools that can verify 100% code coverage in tests, you could fuzz inputs until you find that you've hit every single branch of the executable's instructions and record all dependencies as you go.

You could argue that that can still be fooled by, e.g., making the software dlopen the argument given to it at which point that codepath would have different dependencies each time it was hit, but that argument quickly devolves. That same argument says that when I run `ls /tmp/file` that makes `/tmp/file` a dependency of ls and thus I must include every file in the image else it will have different behavior.

I think intelligent fuzzing + high branch coverage can prove that you have found all required files.

I don't think you can ever prove that you've found the required files for an arbitrary binary. (I especially have a hard time believing that such a proof would involve fuzzing, which is random.) However, it seems reasonable that you would be able to achieve a high enough level of confidence that this technique would be viable.
You cannot. It reduces to the halting problem, relatively trivially:

    <arbitrary code that cannot open foo.txt>
    do something with foo.txt
This will use foo.txt iff said code halts.

You can, however, prove that you've found a superset of the required files for an arbitrary binary. Or prove that you've found the required files for some, but not all, arbitrary binaries.

You're wrong.

You cannot say you haven't found all the dependencies, but you can say you have found all the dependencies (given the constraints I placed above).

The halting problem only says that you cannot prove that a given program will halt.

However, you can prove a specific program halts if, in fact, that program halts.

The original question was not "can prove that I can find the dependencies for an arbitrary binary", but "can you prove that all the dependencies were found for a single specific binary".

For some program that has an infinite loop you can say "I don't know if I've found everything", but if you have shown that you have hit every code branch, as I said above, then clearly this program both halts and has had all dependencies found, excepting different behavior for user input within those already explored branches.

I was thinking the same thing during the article, and the author says as much in the 'Last Thought', and doesn't recommend using this for production purposes. The footnotes say that this was more of an exercise in the syscall than in docker.