Hacker News new | ask | show | jobs
by jasonwhite 1299 days ago
TL;DR: This is a Rust project that forces deterministic execution of arbitrary programs and acts like a reproducible container. That is, it hermetically isolates the program from sources of non-determinism such as time, thread interleavings, random number generation, etc. Guaranteed determinism is a powerful tool and it serves as a basis for a number of applications, including concurrency stress testing, record/replay, reproducible builds, automatic diagnosis of concurrency bugs, and more.

I've been on the team working on this project over the past ~2 years. AMA!

Here is the GitHub repository: https://github.com/facebookexperimental/hermit

11 comments

Nice. Some questions:

- How does this compare with rr?

- Out of curiosity, have you ever looked at libTAS? (I realize it has a very different intended use case.)

- Have you had an issues with behavior differences between CPUs? I know there is architecture-level undefined behavior that can differ between CPUs; on the other hand, it sounds like you’re primarily interested in running well-behaved executables that wouldn’t intentionally invoke such behavior.

- We've taken a lot of inspiration from RR and it is very good at what it does (record/replay + randomizing thread schedules). This project diverges a bit from RR in that we focus more on the concurrency testing application of determinism. I wrote the toy record/replay system that sits on top of the determinism layer (so that we don't have to record as much stuff), but it's a long way from being complete.

- I haven't seen libTAS until now, so I can't comment on it much. At first glance, it does look similar!

- See @rrnewton's reply on this one.

(1) rr [formerly Mozilla rr]

We're big fans of rr!

Hermit is different in that creating a deterministic OS semantics is different than recording whatever nondeterministic behavior occurs under normal Linux. BUT, there's a lot of overlap. And indeed `hermit record` is straight up RnR (record & replay).

But hermit for RnR but is not nearly as developed as rr. We integrate with gdb/lldb as an (RSP) debugger backend, just like rr. Any failing execution you can create with hermit, you can attach a debugger. But our support is very preliminary, and you'll probably find rough edges. Also, we don't support backwards stepping yet (except by running again).

If we invest more in using Hermit as a debugger (rather than for finding and analyzing concurrency bugs), then there should be some advantages over traditional RnR. These would relate to the fact that deterministically executing is different than recording. For example, process and thread IDs, and memory addresses all stay the same across multiple runs of the program, even as you begin adding printfs and modifying the program to fix the bug. With traditional RnR, you can play the same recording as many times as you like, but as soon as you take a second recording all bets are off wrt what is the same or different compared to the prior recording. (That includes losing the "mental state" of things like tids & memory addresses, which is a good point Robert O Callahan makes about the benefits of RnR when accessing the same recording multiple times.)

(2) libTAS - no we haven't! Checking it out now.

(3) Yes, definitely issues with CPU portability.

In general, we are interested in not just determinism on the same machine, but portability between machines in our fleet. As with any tech company that uses the cloud, at Meta people are usually trying to debug an issue on a different machine than where the problem occurred. I.e. taking a crash from a production or CI machine to a local dev machine.

The way we do this is that we mostly report a fairly old CPU to the guest, which disables certain features IF the guest is well behaved.

With the current processor tech, I don't think there's any way we can stop an adversarial program, which, for example, would execute CPUID, find that RDRAND is not supported on the processor, but then execute RDRAND anyway. We could build a much more invasive binary-instrumentation based emulator that would be able to enforce these kinds of rules at the instruction granularity, but it would have higher overhead, especially startup overhead. The nice thing about Reverie though is that we (or others) can add different instrumentation backends while keeping the same programming instrumentation API. So we could have a "hardened" backend that was more about sandboxing and reverse-engineering adversarial software, making a different tradeoff with respect to performance overhead.

Very cool to see the stuff you e been working on become public!
Note that this is a follow-on project from the earlier Dettrace system, which was applied mainly to reproducible builds (as in the academic paper, https://dl.acm.org/doi/10.1145/3373376.3378519, and presented to the Debian Reproducible Builds summit):

- https://github.com/dettrace/dettrace

And one cool part of it is this Rust program instrumentation layer:

- https://github.com/facebookexperimental/reverie

It's good for building OS-emulator style projects or tracing tools.

Sorry, just an additional question as this project is so interesting:

Does Hermit support running different processes in parallel in certain situations?

It seems like this should be possible to do as long as they are appropriately isolated from each other.

Specifically, if two processes don't share any writable memory mappings then I'm thinking it might be possible for Hermit to exploit the underlying existing isolation between them so that they could run code in-between system calls in parallel (while still making sure that the actual system calls happen in a deterministic order).

Perhaps it would even be possible for the two processes to execute system calls in parallel if they are unrelated and they do not affect deterministic execution (such as if they are doing I/O to different files or directories, or one process doing I/O while another is doing something completely unrelated like getting the time, etc).

Although I guess this latter point (of executing syscalls in parallel) would require doing rewind/replay because it wouldn't be possible to know in advance whether the system calls are going to be related or not, as the two processes might not do the syscalls at exactly the same time.

Is Hermit doing this (executing code in-between syscalls in parallel, at least)? Or do you think it would be possible to do it?

This could significantly increase the usefulness of Hermit for distros that want to do deterministic builds of packages, as most compilation could happen in parallel / i.e. use several cores at the same time!

I'm thinking about huge package builds such as Chromium, Firefox, the Linux kernel, etc. which would perhaps take days to build if they were completely serialized into a single CPU core.

[Caught by the no-procrastination feature by accident.]

Our earlier (dettrace) prototype allowed syscall-free regions in separate processes to run physically in parallel. Hermit actually hasn't added any process parallelism yet, but it's designed to actually go further than dettrace in this respect.

Specifically, Hermit is architected so that the thread-local syscall handler "checks out" resources from the central scheduler. Resources include things like paths on the file system, contents of files, shared memory, and permission to perform external side effects. Right now, all requests wait for the scheduler to background all other threads and make the current thread the only runnable. But the idea is: in the future we will keep the semantical identical log of linear "commits" (linearization), but will simply background the current thread while it uses the resources they checked out, move forward to the next scheduler iteration, and only block the next runnable thread until its requested resources are freed, not until ALL other threads have finished their timeslice and gone back to waiting on the scheduler.

Wow, that's so cool!

That's actually much better than what I was thinking and if I understand you correctly, that should allow you to exploit practically all possible parallelism opportunities (given the constraints).

I'm guessing that your plan would require unmapping shared memory pages so that Hermit can catch memory accesses and request permission from the scheduler (to access the memory pages) when these memory accesses happen. But I think it should be possible to do it, yes (with some overhead if two threads/processes access shared memory pages simultaneously very often, I guess).

Awesome work, thank you!

> thread interleavings

I was going to ask if it could do the flip side - instead of stabilizing the scheduler, make it less predictable.

AFAICT, it can! Awesome, looking forward to giving it a try.

   hermit run --chaos --seed-from=SystemRandom ./target/debug/hello_race;
Yes indeed.

That concurrency testing capability is a pretty well-studied area and we implement a couple existing algorithms. The first is our adaptation of the PCT algorithm (ASPLOS'10 https://www.microsoft.com/en-us/research/wp-content/uploads/...). That's what you get by default with `--chaos`.

But we also have variations on straight up randomized scheduler (random thread selection at each time step).

rr chaos mode has its own take on this: https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mo...

This study compares a few approaches - http://www.doc.ic.ac.uk/~afd/homepages/papers/pdfs/2016/TOPC....

Thanks for open-sourcing this! Roughly, what's the performance overhead from running code under hermit? I'm wondering if this could be used for doing benchmarking with less variance on non-deterministic platforms such as the JVM (I assume hermit is "deterministic enough" that the JIT and GC threads of the JVM will run the same code on every execution?)
Alas the performance overhead in realtime is not great yet. It still uses ptrace currently, which often results in a multiple-X slowdown (but at least it doesn't "subscribe" to every syscall like strace does, because some are naturally deterministic). Reverie's whole design is to make it support swappable backends, and this ptrace backend is just the reference implementation. The `experimental/reverie-sabre` directory in the Reverie repo contains our high performance backend, but it's still work-in-progress. It uses binary instrumentation and in our early experiments is 10X faster than our current backend in the worst case (i.e. strace is >10X faster when rewritten with reverie-sabre and run on a program that does nothing but syscalls).

But to the second part of your question about deterministic benchmarking, that is really a separate question. Hermit defines a deterministic notion of virtual time, which is based on the branches retired and system calls executed by all threads. When you run hermit with `--summary`, it reports a total "Elasped virtual global time", which is completely deterministic:

$ hermit run --summary /bin/date

...

Elapsed virtual global (cpu) time: 5_039_700ns

Therefore, any program that runs under hermit can get this deterministic notion of performance. We figured that could be useful for setting performance regression tests with very small regression margins (<1%), which you can't do on normal noisy systems. Compilers are one place I've worked where we wanted smaller performance regression alarms (for generated code) than we could achieve in practice. We haven't actually explored this application yet though. There's a whole small field of people studying performance modeling and prediction, and if one wanted to try this deterministic benchmarking approach, they might want take some of that knowledge and build a more accurate (correlated with wall time) performance model, more realistic than Hermit's current virtual time that is.

Does running /bin/date under hermit always return the same time? Or does it just follow the same codepath to retrieve the actual time?
Well, the starting datetime at the beginning of execution in the container is whatever you set it to:

$ hermit run --epoch=2022-01-01T00:00:00Z /bin/date

Fri Dec 31 16:00:00 PST 2021

We, somewhat eccentrically, put it in last millennium by default. It used to default to the original Unix epoch back in 12/31/1969, but that was causing some software to be very unhappy ;-).

The reproducibility guarantee is that the behavior of the program is a deterministic function of its initial configuration. The epoch setting is one aspect of that initial configuration (as are file system inputs, RNG seeds, etc).

> We, somewhat eccentrically, put it in last millennium by default.

Hah, that is still going to make some TLS software and their certificate tests very unhappy! Does it show that I've ran into a similar issue before? ;)

But of course, it's trivial to fix with the --epoch parameter :)

This is exactly the type of thing I've been wanting for testing my chess engine. Parallelism is based on emergent pseudorandom effects of things like interleaving causing searcher threads to mostly end up in non-overlapping parts of the search tree.

One question: How do you avoid the program being affected by things like overall system load and memory pressure?

Since CPU is a "compressible" resource, system load doesn't affect the determinism. It'll make it slower of course. Since memory is a non-compressible resource, things can start getting killed by the OOM-killer and there's nothing we can do about it. There are also certainly things like external network communication and file system access that are non-deterministic that must be handled at a higher level (e.g., with a reproducible FS image or by recording & replaying network traffic).
The idea of CPU being compressible is very insightful, thanks.

I'm curious about what happens with time control though. Engines can be given a time constraint and will use various heuristics to allocate that time. How does Hermit intercept gettimeofday()?

Well, gettimeofday is a syscall, and we do intercept it (along with clock_gettime, clock_gettres, time, nanosleep, and the rdtsc instruction, even though that last one is not a syscall). When we intercept it, we report virtual time back to the guest. We make sure that virtual time is deterministic, across all threads in the container, irrespective of what the wall clock time is on the host machine.

So for instance, if there are multiple threads in a chess engine, and they are racing to write search results to a data structure, these threads will interleave in a reproducible order under Hermit, and the races will resolve consistently.

But the downside is that Hermit does sequentialize execution onto one core. So in the current version, a multithreaded program doesn't get actual wall-clock speedup from its parallelism. (The earlier dettrace did allow some limited guest parallelism, and we plan to bring that back.) For this reason, Hermit's good for consistent testing multithreaded software, but you wouldn't want to run parallel software under it outside of testing.

> AMA!

Eager to try it but encountering the build error here - https://github.com/facebookexperimental/hermit/issues/11

Do you have a reference build log / environment you can share? Last known good commit sha and/or output from "rustup show"?

We're working on it! Should be fixed soon.
Reverted a badly-timed breaking change that came through the sync system. Will fix it properly shortly (and add a Dockerfile and release tag). But for now you may have better luck on the main branch after that reversion, which yielded 6cb5575ffd287289769144ec82e2900cbf6cd1ad.

Let's discuss further on that issue #11.

This looks cool.

Personally I'd also be interested in this for academic work -- anything which makes it easier to be sure an experiment can be reproduced later (a week, year, or decade later, in increasing order of difficulty), is good to have.

Yes, I'm very interested in that as well. I've been involved with the ACM Artifact Evaluation process which has been going on in several conferences for a while.

https://www.acm.org/publications/policies/artifact-review-ba...

But it's been pretty frustrating. As an author, my PLDI 2014 artifact stopped working less than 5 years later (Docker image binary incompatibility). And when I was co-chair of an Artifact Evaluation Committee in 2017, there was not great reproducibilty of the artifacts that were submitted either.

If you package a VM (freezing the Linux kernel), and are pretty sure that VM will run in 10 years, PLUS you determinize the execution itself... that should allow durable bitwise reproducibility. Maybe Hermit could be one ingredient of that.

For scientific reproducibility, there is a lot of other tooling to build too, and I know some folks have been working in that area:

https://ctuning.org/

> AMA!

I have a few questions. Hermit's README says:

> Instead, in order to provide complete determinism, the user should provide a fixed file system base image (e.g., with Docker)

How are you sanitizing the result of stat(), for example?

I'm guessing you're already aware of this, but fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.

Specifically, inode numbers are not guaranteed to be assigned in any order. So how do you return deterministic inode numbers in stat() calls?

I'm sure you're also aware of this, but other filesystem results are also not guaranteed to be deterministic. For example, the `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens from the user-space point of view except time passing (this happens in ZFS, for example, due to delayed allocation).

statvfs() is another problematic call which would have to be heavily sanitized.

As another example, there are filesystems that generate a random seed for every directory that is created, to prevent applications from exploiting hash table weaknesses and causing a denial-of-service when creating many directory entries that hash to the same value.

This random seed can have the side effect of readdir() calls returning directory entries in a different order based on the seed, even if the directory was created in the exact same way (with identical directory entries, created in the same order).

It seems like this would require reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.

Similarly, telldir() / seekdir() can also be problematic because the cookie value returned by the filesystem in telldir() is opaque and would be different for different directory seed values.

This seems like it would require Hermit to maintain a full in-memory view of all the directory entries of a directory that is opened and is being read, which also means that this in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process and calls that modify the directory by another process (Edit: I guess Hermit can synchronize this view easily since it can assume it has full control of all processes inside the container).

It also seems like Hermit would also need to re-read the entire directory when rewinddir() is called, to avoid introducing non-previously-existing concurrency bugs like described in the above paragraph (Edit: again, this is probably not necessary since Hermit can maintain a synchronized view of a directory on its own).

Reading the directory also seems to imply that you'd have to assign deterministic inode numbers for all directory entries, which also seems non-trivial.

Do you think this is an accurate assessment of the situation? How does Hermit handle all of this?

Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?

Last question: how do you sanitize the RDTSC CPU instruction?

Thanks for all the questions! Whew, here goes.

> How are you sanitizing the result of stat(), for example?

Ah, that part's the same as it was described in the ASPLOS'20 paper (https://dl.acm.org/doi/10.1145/3373376.3378519). Briefly, we present a somewhat sanitized version of the file system. Like if you do `hermit run -- ls -l`, you'll see that you are root, and everything owned by you is owned by root (and everything else is owned by nfsnobody currently). The file mod/access times are all set to whatever you provide for `--epoch`, because we think you mainly care about the file system contents, not the mod times. (Mod times do CHANGE as execution progresses though, otherwise programs like `make` become confused.)

> fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.

Indeed! Hence we present only the sanitized view of the filesystem, so that we can treat each filesystem as its abstract contents (files as bitstrings, and directories as sorted lists of files). For inodes, we translate to and from virtual inodes, rather than reveal the real inodes of the file system.

If there's a flaw in this abstraction of the file system... we'd love to find it. Mainly we've been using it with zfs and Meta's "edenfs" virtual file system so far.

> `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens

Yes, now you feel our pain. We've been plugging these kinds of things, and there are surely some we missed. We report boring constant values where we can get away with it (for st_blksize). Irreproducible "counter example" programs are a welcome form of contribution ;-)!

> As another example, there are filesystems that generate a random seed

Ah, that's an interesting one. Since we determinize (only) userspace, any random bytes that get into guest memory are deterministic (getrandom, /dev/random, etc), but internal nondeterminism in the filesystem we won't see, and instead we'll rely on sanitizing the way the filesystem appears to userspace. But if it just affects order, we should be ok, because we sort before returning from the getdents syscall.

> reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.

Yes, unfortunately. That's not great for performance, and we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.

> telldir() / seekdir()

Well these (and readdir) are not syscalls. So if we've handled getdents correctly we should be ok here. But this is a good suggestion for tests we need to add that stress these!

> in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process

Yes, we have a "non-interferene" assumption that either you're running with a container-private file system (e.g. inside docker) or none of your directories are concurrently messed with by other processes outside the container.

> assign deterministic inode numbers for all directory entries, which also seems non-trivial.

Yep, that's what we do!

> Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?

We require a certain level of good behavior by the program. To be practical, we aim to work for realistic programs but not necessarily adversarial ones. One way that you can break our sandboxing, for example, is to run CPUID, learn that the processor does not support instruction X, but then execute X anyway. For example, we can trap RDTSC in userspace, but not RDRAND.

If someone wants to use Hermit for, say, reverse engineering malware, then we need a Reverie backend that is hardened by emulating/rewriting the instruction stream carefully to protect against unsupported instructions or undefined conditions.

> Last question: how do you sanitize the RDTSC CPU instruction?

That one we can trap in userspace and then we return deterministic virtual time. Currently that time is a linear combination of the branches and system calls executed by all threads up to the current moment in time. For example, if you do RDTSC in a loop, you will see time advancing.

But using rdtsc to recreate fine-grained timers and implement side-channel attacks is impossible under Hermit. (Side channel attacks in general are only possible if first breaking the sandboxing in some way, and we don't know of a specific attack that will do the trick yet.)

That makes a lot of sense, thanks for all the answers!

It's definitely a very interesting project and I think that once more programs can work without changes, it is something that could perhaps be used for building packages by distros that have displayed a special interest in (and can benefit significantly from) reproducible builds -- like Debian for instance, but especially NixOS due to the existing work on content-addressed paths (explained here: https://www.tweag.io/blog/2020-09-10-nix-cas/ ).

The serialization of threads (and I assume processes?) would probably be an issue for large packages, but at least it would be possible to build many packages in parallel once the base dependencies have been built.

I had thought about starting such a project myself (for reproducible builds), but of course, the amount of work is very significant and for me it was not worth the benefits :)

> we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.

Actually, I just thought that if you decide to go down this path because of this specific pathological issue -- the cache that you're mentioning could simply be a file that Hermit maintains to represent a sorted view of a directory (and which could be deleted after Hermit finishes running if you so desire, although you could keep it in-between runs with some care!).

So you'd only have to pay the cost of reading the entire directory once per directory that is read -- specifically, on the first time that some program reads it.

You could even pre-generate such cached views for all directories in the container once Hermit starts, if you'd prefer to pay this cost at start-up (although I don't see why).

All subsequent operations relating to a directory could use the corresponding file as a cached and sorted view of it (and such operations that modify the directory would have to keep the file in sync, of course).

So even if a program reads a directory, then closes it, and some other program opens it and reads it, this second read would use the cache file instead, so it would not have to read the whole directory at once.

This would be even better for directories that are created by programs running inside Hermit, since the cache file would be created along with the directory when it is initially created, so there would be no need to read the entire directory at once at any time.

It would also mean that Hermit wouldn't have to potentially use large amounts of memory just because of this cache, or "forget" entries from the cache while it is running (to limit memory usage), since it would be stored persistently on-disk (and would still benefit from in-memory caching from the underlying filesystem / kernel).

Just an idea.

Yeah, it's interesting to think about persisting the state we would need to make the file system more sympatico with Hermit. If we were willing to have a daemon.... Meta develops this "watchman" tool that our build infrastructure uses. I think for existing file systems we could imagine a daemon that watches the directory and caches what we need.

But if we could dream of new file systems, then I want one that is basically btrfs but with O(1) file-level parallel-hashes (rather than block level). Heck maybe it could even enforce a sorted order on directories for us too. The hashes would be very useful in record-and-replay scenarios, where you could know immediately whether the input files are in your content-addressible blob storage without having to hash them every time.

(We have some hash support in Meta's EdenFS FUSE file system https://github.com/facebook/sapling)

P.S. about Reproducible Builds -- we pitched this to the community in 2019 (at the Reproducible Builds Summit) and with that ASPLOS'20 paper, but we would be eager to see anyone wanting to pick it up and take it further.

Are there any limitations regarding hermit running programs which emit code at runtime?
There shouldn't be! We don't do any sort of binary rewriting, which is the typical problem with JITed code.
Would something like this work for embedded code, like ESP32?
Well, probably not on device ;-).

The underlying Reverie instrumentation layer works on ARM, but Hermit isn't ported yet, and we haven't touched RISC-V yet at all. (Contributions welcome!)

One thing we haven't tried yet is just putting a whole emulator (qemu etc) underneath Hermit. That would address any sources of irreproducibility that the emulator lets through from the host (threads, RNG, etc).