Blink is primarily a user mode emulator, but it does support real mode BIOS programs. It can even bootstrap Cosmopolitan Libc bare metal programs into long mode. Here's a video of Blink doing just that. https://storage.googleapis.com/justine/sectorlisp2/sectorlis...
KVM requires additional privileges. A Linux container would need privileged rights and access to /dev/kvm to run QEMU with KVM for example, whereas any container should be able to run it in user-mode.
That's not really an issue, as there's a lot of infrastructure around optionally giving device file access to containers. That's why SECCOMP_IOCTL_NOTIF_ADDFD exists.
Nobody's really set it up to do that as it's easier to use Linux's sandboxing features if you're looking to run user code of the same cpu ISA. GVisor has an (experimental last time I checked) backend that uses KVM to run user mode code, but there you have the win of the sandboxing code being written in a memory safe language and giving you a real privilege boundary as opposed to the sieve that qemu-user is. In just about every other instance just running code natively in regular user space (even if sandboxed with seccomp or a ptrace jail) achieves the underlying goals better.
It depends on whether you're more afraid of language bugs or hardware bugs. One potentially nice thing about having a tool like Blink that can fully virtualize the memory of existing programs, is it's sort of like an extreme version of ASLR. In order to virtualize a fixed address space, you have to break apart memory into pieces and shuffle them around into things like radix tries, and that might provide enough obfuscation of the actual memory to protect you from someone rowhammering your system. I don't know if it's true but it'd be fun to test.
Well, not quite 'native'. TLB refills are 4x to 5x as expensive, and anything that needs a context switch tends to be at a minimum twice as expensive, and it's common to balloon even farther from there.
I guess that's mostly if you are running a full operating system inside it, generally in Qemu. It doesn't have to be - could just be a program. Tiny programs running in KVM can use big pages and never cause or require any pagetable changes.
The guest has it's own page tables above the nested guest phys->host phys tables.
> What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".
And then the kernel doesn't know what to do with nearly every guest exit on KVM, so then you trap out to host user space, which then probably can't do much without the host kernel so you transition back to kernel space to actually perform whatever IO is needed, then back to host user, then back to host kernel to restart the guest, then back from host kernel to guest. So six total context swaps on a good day guest->host_kern->host_user->host_kern->host_user->host_kern->guest.