| > AMA! I have a few questions. Hermit's README says: > Instead, in order to provide complete determinism, the user should provide a fixed file system base image (e.g., with Docker) How are you sanitizing the result of stat(), for example? I'm guessing you're already aware of this, but fixed file system images are not sufficient to guarantee deterministic behavior by filesystems. Specifically, inode numbers are not guaranteed to be assigned in any order. So how do you return deterministic inode numbers in stat() calls? I'm sure you're also aware of this, but other filesystem results are also not guaranteed to be deterministic. For example, the `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens from the user-space point of view except time passing (this happens in ZFS, for example, due to delayed allocation). statvfs() is another problematic call which would have to be heavily sanitized. As another example, there are filesystems that generate a random seed for every directory that is created, to prevent applications from exploiting hash table weaknesses and causing a denial-of-service when creating many directory entries that hash to the same value. This random seed can have the side effect of readdir() calls returning directory entries in a different order based on the seed, even if the directory was created in the exact same way (with identical directory entries, created in the same order). It seems like this would require reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry. Similarly, telldir() / seekdir() can also be problematic because the cookie value returned by the filesystem in telldir() is opaque and would be different for different directory seed values. This seems like it would require Hermit to maintain a full in-memory view of all the directory entries of a directory that is opened and is being read, which also means that this in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process and calls that modify the directory by another process (Edit: I guess Hermit can synchronize this view easily since it can assume it has full control of all processes inside the container). It also seems like Hermit would also need to re-read the entire directory when rewinddir() is called, to avoid introducing non-previously-existing concurrency bugs like described in the above paragraph (Edit: again, this is probably not necessary since Hermit can maintain a synchronized view of a directory on its own). Reading the directory also seems to imply that you'd have to assign deterministic inode numbers for all directory entries, which also seems non-trivial. Do you think this is an accurate assessment of the situation? How does Hermit handle all of this? Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this? Last question: how do you sanitize the RDTSC CPU instruction? |
> How are you sanitizing the result of stat(), for example?
Ah, that part's the same as it was described in the ASPLOS'20 paper (https://dl.acm.org/doi/10.1145/3373376.3378519). Briefly, we present a somewhat sanitized version of the file system. Like if you do `hermit run -- ls -l`, you'll see that you are root, and everything owned by you is owned by root (and everything else is owned by nfsnobody currently). The file mod/access times are all set to whatever you provide for `--epoch`, because we think you mainly care about the file system contents, not the mod times. (Mod times do CHANGE as execution progresses though, otherwise programs like `make` become confused.)
> fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.
Indeed! Hence we present only the sanitized view of the filesystem, so that we can treat each filesystem as its abstract contents (files as bitstrings, and directories as sorted lists of files). For inodes, we translate to and from virtual inodes, rather than reveal the real inodes of the file system.
If there's a flaw in this abstraction of the file system... we'd love to find it. Mainly we've been using it with zfs and Meta's "edenfs" virtual file system so far.
> `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens
Yes, now you feel our pain. We've been plugging these kinds of things, and there are surely some we missed. We report boring constant values where we can get away with it (for st_blksize). Irreproducible "counter example" programs are a welcome form of contribution ;-)!
> As another example, there are filesystems that generate a random seed
Ah, that's an interesting one. Since we determinize (only) userspace, any random bytes that get into guest memory are deterministic (getrandom, /dev/random, etc), but internal nondeterminism in the filesystem we won't see, and instead we'll rely on sanitizing the way the filesystem appears to userspace. But if it just affects order, we should be ok, because we sort before returning from the getdents syscall.
> reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.
Yes, unfortunately. That's not great for performance, and we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.
> telldir() / seekdir()
Well these (and readdir) are not syscalls. So if we've handled getdents correctly we should be ok here. But this is a good suggestion for tests we need to add that stress these!
> in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process
Yes, we have a "non-interferene" assumption that either you're running with a container-private file system (e.g. inside docker) or none of your directories are concurrently messed with by other processes outside the container.
> assign deterministic inode numbers for all directory entries, which also seems non-trivial.
Yep, that's what we do!
> Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?
We require a certain level of good behavior by the program. To be practical, we aim to work for realistic programs but not necessarily adversarial ones. One way that you can break our sandboxing, for example, is to run CPUID, learn that the processor does not support instruction X, but then execute X anyway. For example, we can trap RDTSC in userspace, but not RDRAND.
If someone wants to use Hermit for, say, reverse engineering malware, then we need a Reverie backend that is hardened by emulating/rewriting the instruction stream carefully to protect against unsupported instructions or undefined conditions.
> Last question: how do you sanitize the RDTSC CPU instruction?
That one we can trap in userspace and then we return deterministic virtual time. Currently that time is a linear combination of the branches and system calls executed by all threads up to the current moment in time. For example, if you do RDTSC in a loop, you will see time advancing.
But using rdtsc to recreate fine-grained timers and implement side-channel attacks is impossible under Hermit. (Side channel attacks in general are only possible if first breaking the sandboxing in some way, and we don't know of a specific attack that will do the trick yet.)