Hacker News new | ask | show | jobs
by anatoly 1649 days ago
I'd love to understand virtualization/emulation better, on the technical level. If I understand correctly, in modern CPUs it's all done with the help of specialized support in the instruction set, but I've never learned in depth

   - how that works
   - how it differs between Intel/AMD/ARM
   - whether there are competing approaches
   - what are the main software project "players" in the field
   - which of them are open-source with humanly understandable 
   source code
Is there a summary/tutorial about all this stuff someone could recommend (or write)?
4 comments

For 1 and 2 you can go straight to the sources:

Volume 3 of the Intel 64 and IA-32 Architectures Software Developer’s Manuals, Chapter 22: Introduction to Virtual Machine Extensions:

https://www.intel.com/content/www/us/en/develop/download/int...

Volume 2 of the AMD64 Architecture Programmer's Manual, Chapter 15: Secure Virtual Machine:

https://www.amd.com/system/files/TechDocs/24593.pdf

Well, yes, you can do that, but Intel's Software Developer's Manuals are terribly written. They are good as references for details, but only after you understand the grand picture. Learning something new from them is quite a torturous process.
Maybe it's that brain worm that I got from staring at too much assembly, but I actually find that part of the Intel manuals to be quite approachable.

Even to get a general overview, the introduction section I linked seems fine?

Here's the really simple explanations.

Emulations is pretty much literally just mapping instructions between processors. So there may be an instruction in my custom chipset called "Add4", which adds 4 inputs. I would emulate ADD4 RAX, 0x1234, 0x2345, 0x3456 that by

ADD RAX, 0x1234; ADD RAX, 0x2345; ADD RAX, 0x3456;

It gets a bit more complicated with architecture differences like memory configurations. But that all emulation is.

When you're virtualizing, you pretty much just need to manage hardware. The hypervisor does this for you by managing which resources go to where. You could virtualize it by just running it like a program. But that's really painful and tedious, so you rely on the CPU to support it. Each chip has it's differences, but it's effectively just like a syscall. You have VMCALL and VMEXIT instructions. And then you have a handler in your vmexit table, which is exactly like a syscall table. So if(exitreason == CPUID_EXIT) cpuid_handler();

For a good book you can look up "Hardware and software support for virtualization" https://www.amazon.com/Hardware-Software-Virtualization-Synt... . It's honestly the only good resource i've found on what really makes this work.

Thank you for a good explanation and book recommendation.
I know next to nothing about hardware / instruction set support from the CPU side, but there are several hypervisors who are open source and thus available for study: Qemu itself, Xen, Linux KVM, FreeBSD's bhyve, OpenBSD's vmm. Oh, and there's VirtualBox.

I vaguely recall someone telling me that the qemu source code is fairly readable, but I have not looked at any of these myself, so consider it hearsay.

I don't know about the others, but the FreeBSD developers have a separate mailing list for discussion of virtualization, https://lists.freebsd.org/subscription/freebsd-virtualizatio...

> I vaguely recall someone telling me that the qemu source code is fairly readable [...]

That someone was either pranking you or knows a mysterious part of the qemu codebase that's unlike the rest of the criminally underdocumented qemu soup. Or my standards are too high.

The source is okay, the problem is that there's really no docs/examples for even the basic stuff. Do you want to boot from a virtual block device and PXE from a network drive ... well, maybe these 300 command line arguments in some random order will work. Eventually. Have fun. So reading the source is a big help in that. But it's not ideal. And libvirt/oVirt/etc are helpful, but not for just that quick and dirty "let's try something out with the linux kernel in qemu" thing.
That person might have been sarcastic, I've been known to miss that, especially in written conversations. I also might misrember.
The xhyve fork/port of bhyve for MacOS is worth mentioning: https://github.com/machyve/xhyve
a quick bird's eye view summary to get your bearings:

(caveat, I'm taking shortcuts so things may not survive scrutiny but it should be good enough for an overall understanding and getting down to the truth if you so desire)

emulation: software that implements and simulate hardware to mimic it as expected by software that would run within emulation

this is slow, because you have to really pretend hardware down to the details (simulate clocks, irqs, video output, etc). often you can get away with not being 100% exact and accurate depending on the emulated software's assumptions or you don't care if some specific output is incorrect (e.g graphical difference or oddity in games), which gives you some nice speedups. 100% accurate emulators can and do exist but you can see the difference in requirements (see mesen, higan) with inaccurate emulators.

some people quickly figured out that if you are running software compiled for a specific target (e.g a x86_64 CPU) that is the same target as the host then it kinda doesn't make sense to emulate the CPU... and that you can kinda obviously pass through cpu instructions instead of simulating the very hardware you're running on... this is:

virtualization: instead of emulating, expose a non-emulated device from the host that the software is known or expected to be compatible with, thus saving a lot of cycles.

for the CPUs this is typically done through an instruction that creates an isolated context in which the software will run. reminder: a CPU has multiple "rings" already under which software is run, for security and stability reasons (memory protection, etc...), e.g the kernel runs under ring 0 and userland processes are under ring 1, each in a different context preventing one to access the other in the same ring, and to go up rings, all checked at the hardware level. the virtualization instructions roughly do the same thing and nest things so that a virtual cpu runs code in a separate ring+context, in a way that the guest OS thinks it's alone. see also type 1 (xen, hyper-v) vs type 2 (qemu, vmware, parallels) hypervisors.

other devices can be implemented in lightweight, efficient ways, and exposed as virtual devices to the virtual machine for the nested OS to pick up. e.g instead of emulating a full PCI network card complete with pseudophysical hardware matching a real one, one can have a simpler device exposed that merely takes ethernet packets and inject them into the host kernel's network stack. or same for IO, instead of emulating a sata device complete with commands and all, one could just expose a block device and ask the host kernel to read its data from some host file directly. this requires dedicated drivers for these virtual devices in the guest OS though.

virtualization is as good as native because often it's literally native, only with hardware sandboxes and maybe a few added indirect wrappers, that Windows on PCs and apps and games on Xboxes actually run in VMs under HyperV all the time!

so back to the CPU side of things, you can notably see how it is impossible with virtualisation for a CPU to execute anything else than its own instruction set.

now, executing foreign instructions more efficiently than literally interpreting them can be achieved through various means such as binary translation or dynamic recompilation, but it still has to take some guest assumptions into account, e.g Rosetta 2 (which is neither virtualisation nor emulation) translates x64 code to arm64 but x64 assumes a certain ordering of operations between cpu and memory that just cannot work as is on arm64, so there's a special Apple-specific instruction to switch a CPU thread to a different memory model, still running arm64 instructions (translated from x64) but in a way that matches the x64 behaviour.

there are many more gritty details, and things are not always as clear cut as I described (or flat out lied about, the Feynman way) but it should be a good enough model to start with.

A wonderful answer. Thank you!