Hacker News new | ask | show | jobs
by fwsgonzo 1264 days ago
The guest is not in control - sure theres a few pages at the beginning of each section that has to be 4k until you reach the first 2MB-multiple.

What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

The point is: KVM is native speed if you never have to leave. I don't need to prove this for anyone to understand it has to be true.

2 comments

> The guest is not in control

The guest has it's own page tables above the nested guest phys->host phys tables.

> What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

And then the kernel doesn't know what to do with nearly every guest exit on KVM, so then you trap out to host user space, which then probably can't do much without the host kernel so you transition back to kernel space to actually perform whatever IO is needed, then back to host user, then back to host kernel to restart the guest, then back from host kernel to guest. So six total context swaps on a good day guest->host_kern->host_user->host_kern->host_user->host_kern->guest.

Right, that's very true! It's clear that you know what you're talking about when it comes to KVM and maybe even the internal structure in Linux. However, I/O can be avoided. Imagine a guest that needs no I/O, doesn't have any interrupts enabled, and simply runs a workload straight on the CPU (given that it has all the bits it needs). That is what I have made for $COMPANY, which is in production, and serves a ... purpose. I can't really elaborate more than I already have. But you get the gist of it. It works great. It does the job, and it sandboxes a piece of code at native speed. Lots of ifs and buts and memory sharing and tricks to get it to be fast and low latency. No need for JIT, which is a security and complexity nightmare.

The topic of this thread is about Blink, which happens to be a userspace emulator. Hence my comment.

I usually measure the functions I write in picoseconds per byte, so 5 microseconds is an eternity.
10 ps/byte is equivalent to 100 GB/sec; unless you routinely write functions that are in the tens of GB/sec range, you probably mean nanoseconds?
I work on a C library. Some of the functions I've written, like memmove(), take about 7 picoseconds per byte for sizes that are within the L1 cache, thanks to enhanced rep movsb.
That's a very special case though since it's hardware optimized to work up to a cache line at a time, and not at all related to the syscall cost that was mentioned in the parent comment.
The 5us was the setup time in order to be able to enter the sandbox. A system call is around 1us, but rarely used. So, in general the overhead of using the sandbox is around 5us, as everything else is pure workload.