Why do AMD and Intel insist on making virtualization complex?
OK, So I was reading up on the new virtualization architectures that both AMD and Intel introduced. My first reaction...why the heck did they make it so unnecessarily complex?
Here's the thing, in the x86 architecture, there are a handful of instructions that make virtualization hard, 17 of them to be exact. What I mean by this is that these instructions don't throw an exception when run in user mode (ring 3), but either act differently than in system mode (ring 0) or expose sensitive information.
VMware and other virtualization technologies on x86 address this by using dynamic translation techniques. This isn't a new idea, if you are from the emulation world then this is just a combination of an emulator and a just in time compiler. People writing emulators tend to call this technique "dynamic recompilation." This way you can put a break on any instruction you like, and you can run the code at nearly native speed.
However, to get the real speed gains, user mode code is run unchanged, natively. This means that if user mode executes a sensitive instruction that doesn't throw an exception, no way to intercept it. Software can leverage this fact to find out information that you would normally want hidden from the guest OS.
A classic example of this is the "redpill" code. What it basically does is call the
sidt instruction from user mode. In kernel mode, VMware would trap this, emulate the desired result and continue with the OS none-the-wiser. However, when executed in user mode, there is no trap, it just works. In fact, it should provide the caller with the real IDT of the host system. Aside from the fact that if an attacker ever figures out a way to write to memory in the host this would be useful information, the discrepancy between the two results is a clear cut sign of virtualization. This is just one of the many instructions that can be used to detect virtualization.
So what's the solution? According to both Intel and AMD the solution is a new mode of operation which is more privileged than ring 0. A much more simple approach would be to add a new flag to a system register that when set, makes all sensitive operations cause an exception in user mode. This would allow the guest kernel code to run unmodified in user mode. No more dynamic translation, making the whole virtualization concept a lot simpler.
If this is done, there are only a few challenges left. The first would be protecting the guest OS code from the guest user code. Since they are both running at user mode, page level protection don't work. If we are looking at x86 only and not x86-64, then segmentation is a viable solution. (I did hear that certain models of the AMD bring back segmentation limits). The rest of the concerns are fairly usual when it comes to virtualization such as properly emulating devices and such. For these the usual strategies of page fault trapping and I/O port trapping should work well as they have in the past.
I'm not the only one to think of this, the folks at pagetable.com came to a very similar conclusion. The question is, why wouldn't Intel/AMD think of this, they are pretty smart. And even so, if I worked for VMware I'd probably push for a technology like this.