| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gabcoh 1264 days ago
	The comparison with QEMU is with KVM disabled, right? Assuming this is true, how does it compare with KVM enabled?

2 comments

monocasa 1264 days ago

I think this is a user mode emulator, so qemu with kvm isn't a great comparison.

link

jart 1264 days ago

Blink is primarily a user mode emulator, but it does support real mode BIOS programs. It can even bootstrap Cosmopolitan Libc bare metal programs into long mode. Here's a video of Blink doing just that. https://storage.googleapis.com/justine/sectorlisp2/sectorlis...

link

gabcoh 1264 days ago

Is this true? Why can’t qemu use kvm for user mode emulation?

link

fathyb 1264 days ago

KVM requires additional privileges. A Linux container would need privileged rights and access to /dev/kvm to run QEMU with KVM for example, whereas any container should be able to run it in user-mode.

link

monocasa 1264 days ago

That's not really an issue, as there's a lot of infrastructure around optionally giving device file access to containers. That's why SECCOMP_IOCTL_NOTIF_ADDFD exists.

link

monocasa 1264 days ago

Nobody's really set it up to do that as it's easier to use Linux's sandboxing features if you're looking to run user code of the same cpu ISA. GVisor has an (experimental last time I checked) backend that uses KVM to run user mode code, but there you have the win of the sandboxing code being written in a memory safe language and giving you a real privilege boundary as opposed to the sieve that qemu-user is. In just about every other instance just running code natively in regular user space (even if sandboxed with seccomp or a ptrace jail) achieves the underlying goals better.

link

jart 1264 days ago

It depends on whether you're more afraid of language bugs or hardware bugs. One potentially nice thing about having a tool like Blink that can fully virtualize the memory of existing programs, is it's sort of like an extreme version of ASLR. In order to virtualize a fixed address space, you have to break apart memory into pieces and shuffle them around into things like radix tries, and that might provide enough obfuscation of the actual memory to protect you from someone rowhammering your system. I don't know if it's true but it'd be fun to test.

link

fwsgonzo 1264 days ago

KVM allows you to run guests directly on the CPU and has native performance

link

monocasa 1264 days ago

Well, not quite 'native'. TLB refills are 4x to 5x as expensive, and anything that needs a context switch tends to be at a minimum twice as expensive, and it's common to balloon even farther from there.

link

fwsgonzo 1264 days ago

I guess that's mostly if you are running a full operating system inside it, generally in Qemu. It doesn't have to be - could just be a program. Tiny programs running in KVM can use big pages and never cause or require any pagetable changes.

For simple workloads it can even be faster than native unless you dynamically load something that uses bigger pages for your native program, eg. https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...

link

monocasa 1264 days ago

It's harder to force huge pages on a guest than it is to just use them in regular user space where you can simply mmap them in.

And none of that accounts for the increased context switch time.

link

fwsgonzo 1264 days ago

The guest is not in control - sure theres a few pages at the beginning of each section that has to be 4k until you reach the first 2MB-multiple.

What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

The point is: KVM is native speed if you never have to leave. I don't need to prove this for anyone to understand it has to be true.

link

monocasa 1264 days ago

> The guest is not in control

The guest has it's own page tables above the nested guest phys->host phys tables.

> What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

And then the kernel doesn't know what to do with nearly every guest exit on KVM, so then you trap out to host user space, which then probably can't do much without the host kernel so you transition back to kernel space to actually perform whatever IO is needed, then back to host user, then back to host kernel to restart the guest, then back from host kernel to guest. So six total context swaps on a good day guest->host_kern->host_user->host_kern->host_user->host_kern->guest.

link

jart 1264 days ago

I usually measure the functions I write in picoseconds per byte, so 5 microseconds is an eternity.

link