Hacker News new | ask | show | jobs
by bazizbaziz 2184 days ago
This seems like a weird benchmark, reading from /dev/urandom and gzipping random data does not seem like something most folks will want to do. It even appears like /dev/urandom speeds differ greatly on various architectures [0] and there are issues with /dev/random being fundamentally slow due to the entropy pool [1] (but I guess this is why the author uses /dev/urandom).

It would be better to measure something more related to what docker users will actually do, like container build time of a common container, and/or latency of HTTP requests to native/emulated containers running on the some container.

One reason to feel positive about the virtualization issues is that Rosetta 2 provides x86->ARM translation for JITs which an ARM-based QEMU could perhaps integrate into it's own binary translation [2].

[0] https://ianix.com/pub/comparing-dev-random-speed-linux-bsd.h... [1] https://superuser.com/questions/359599/why-is-my-dev-random-... [2] https://developer.apple.com/videos/play/wwdc2020/10686/

1 comments

Author here.

I'm glad somebody said something! Yes the gzip perf test is pretty silly, but illustrates a significant difference. /dev/urandom throughput on this setup was about 100 MB / s so it wasn't a bottleneck for this test - the bottlneck was gzip.

Feel free to come up with a performance test yourself! I personally want to know what an HTTP test would look like. You can run an ARM image by running:

    docker run -it arm64v8/ubuntu
Unfortunately, Rosetta 2 is not going to help here. Rosetta 2 translates x86 -> ARM, but only on Mac binaries. It does not translate Linux binaries, and cannot reach inside a Docker image.
Was your emulation done with qemu user space emulator[1] (the syscall translation layer) or qemu system emulator[2] (the VM)? If it was qemu-system you might have better numbers with qemu-user-static, which does binary translation similar to Rosetta 2 rather than a being a full system emulator with all its overhead.

You can probably use qemu-user-static to translate x86-64-only binaries in a Linux container on an ARM machine, too, but I have never tried.

[1]: https://www.qemu.org/docs/master/user/main.html

[2]: https://www.qemu.org/docs/master/system/index.html

I ran this on a Linux laptop - it looks like it's running qemu-user-static:

    root        9934  103  0.0 125444  6664 pts/0    Rl+  12:25   0:12 /usr/bin/qemu-aarch64-static /usr/bin/gzip
So it might be that Docker already runs a native x86_64 Linux, then uses qemu-static binary translation.
That's strange, in my experience it shouldn't have 6x slowdown. Probably might be due to several factors, but here's your test, running on my system without Docker:

Ryzen 3900X (host machine)

    $ dd if=/dev/urandom bs=4k count=10k | gzip >/dev/null
    10240+0 records in
    10240+0 records out
    41943040 bytes (42 MB, 40 MiB) copied, 1.02284 s, 41.0 MB/s
qemu-aarch64-static

    $ dd if=/dev/urandom bs=4k count=10k | proot -R /tmp/aarch64-alpine -q qemu-aarch64-static sh -c 'gzip >/dev/null'
    10240+0 records in
    10240+0 records out
    41943040 bytes (42 MB, 40 MiB) copied, 3.33964 s, 12.6 MB/s
From the article:

> Emulators can run a different architecture between the host and the guest, but simulate the guest operating system at about 5x-10x slowdown.

I think this is a misleading statement because it implies that there is a constant performance overhead associated with CPU emulation. In reality, the performance relies heavily on the workload, more so with JIT-ed emulators.

Regarding this specific benchmark, I think there are two main factors contributing to the poor performance. The first factor is that the benchmark completes in a short period of time. With JITs, performance tends to improve for long running processes because JITs can cache translation results allowing you to amortize the translation overhead. Another factor is that your benchmark is especially heavy on I/O, meaning that it spends a lot of time translating syscalls instead of running native instructions.

I'd also like to add that CPU emulators sans syscall translation should work for any binaries, even those targeted for Linux. It would require a copy of the Linux kernel, but Docker won't work without it anyways.

So I'm not familiar with how Darwin does things, but on most FOSS unixes it's easy to use qemu to run one arch on another, either full system or just user mode emulation (which when wired up correctly lets you seamlessly execute ex. ARM binaries on an x86 system). I would expect it to be easy enough to either set up user mode translation, or just swap Docker's backing hypervisor with an x86 VM. Or, worst case, just run qemu-system-x86_64 on your ARM Mac, run Linux inside that VM, and run Docker on that Linux; SSH in and it should be mostly transparent.
One benchmark would be to track down a python/JS/etc based "hello world" demo container. Base one version on Intel and the other on ARM, and measure each versions container build-time and request latency after it is set-up.

If changing the base image is all that's needed and both Dockerfiles otherwise assume ubuntu, this should not take too long.