About Project Blog Resources

Resources
People often ask me "How did you learn how to hack?" The answer: by reading. This page is a collection of the blog posts and other articles that I have accumulated over the years of my journey. Enjoy!

An EPYC escape: Case-study of a KVM breakout - 540

Felix Wilhelm - Project Zero (P0)Posted 4 Years Ago

Kernel Based Virtual Machine (KVM) is the standard Linux-based cloud hypervisor platform. Besides Microsoft Azure, every other Cloud provider uses KVM. This vulnerability is in KVM AMD specific code that allows for a full virtual machine escape.
KVM is an open source type-2 hypervisor (runs on top of OS so it's not bare-metal). KVM is implemented with a handful of kernel modules that expose low-level IOCTL APIs to interact with user space processes so that the Virtual Machine Manager (VMM) can manage the environment. For VMM's, popular choices are QEMU, LKVM, Crosvm and FireCracker.
The design of KVM was quite smart: all of the complex low-level code for providing virtual disk, network or GPU access can be implemented in userspace besides a few performance sensitive operations. Because of this, the attack surface is quite limited! A KVM vulnerability would result in compromise of the entire host, but it would be hard to find.
Nested virtualization is about layers of virtualization. Recently, this has became more popular has Virtualization-Based Security has became more prevalent. AMD's virtualization extension is called SVM (Secure Virtual Machine). When making SVM instructions in a nested env, the host needs to intercept these calls from the guest and emulate the behavior.
The virtualization works by adding six new instructions to x86_64 that are enabled when the SVME bit is set on the EFER MSR. The VMRUN command is responsible for running a guest VM, which is done by accepting a page-aligned physical address which describes the state and configuration of the VM called the Virtual Machine Control Block (VMCB).
Inside of the VMCB, there are two main parts. The State Save Area stores the values of all the guest registers including segment and control registers. The Control area describes the configuration of the VM, such as the features enabled for the VM and describes the intercepted actions such as page table addresses. For nested virtualization to work, KVM intercepts execution of the VMRUN instruction and creates its own VMCB based on the VMCB the L1 guest prepared.
Where do bugs live at? Complex and difficult to understand code! Nesting virtualization environments seems like the definition of complex to me. Because of this, the author of the article took to manual code review in the SVM to try to find logic bugs.
When switching between contexts, the KVM cannot trust the guest provided ENV and needs to be validated. When this validation is being done there is a classic problem: double fetch. The validation is done on the first fetch then the data is asked for again.
By passing the data once while it's valid the second configuration can be invalid; this is known as a time of check vs. time of use (TOCTOU). So, what can be done with a weird VMCB configuration?
The SVM VMCB configuration contains a bit that enables or disables interception of the VMRUN command. When this bit is raced, crazy things can happen! Normally, when the function nested_svm_exit_handled is called an exit should occur for the layer 2 virtualized machine which results in the function NESTED_EXIT_DONE being ran from L2 to L1 that lets L1 handle this.
However, when the svm->nested.ctl bit is set to 0, the VM will be handled by the KVM itself for the L2 VM. This results in a second call to nested_svm_vmrun which the code was NOT written to handle. As a result, the L1 context is overwritten by the L2 guest data. This becomes a security issue because the Model Specific Register (MSR) is controllable, which holds many permissions bits.
Eventually, this edge case becomes exploitable because the SVME bit of the MSR_EFER register does dynamic allocations. When the disable SVME call is made in L3, the memory for this in L1 is lost is freed. Eventually, this can be overwritten particular 0s and 1s to give guest access to the host MSRs.
What does having access to the MSRs actually do? The VM_HSAVE_PA value stores the physical address of the host save area. If the MSR points to a memory location under the attackers control, a fake malicious host can be used to execute our own code. Although, this attack was not as straight forward as it sounds.
With the ability to control the flow of execution, the author still needed a information leak in order to proceed (ASLR strikes again). From a BUNCH of reading, they came to the Instruction Based Sampling (IBS) as the perfect way to get an infoleak. It samples a bunch of instructions and collects a wide amount of information. This information is logged in the MSRs, making it perfect for the memory leak.
Finally, the author created a kernel ROP chain. This payload disables the write protection on the kernel memory addresses then allows for the CPU to copy a larger piece of shellcode to some where in the kernel. At this point, with code execution in the kernel, it's a matter of fixing things up and you have won!
This VM escape was from a fairly simple TOCTOU bug. Even though the impact was not easily apparent, by fiddling with the settings on this the author eventually compromised the machine. This was a bug first then what can be done with it second. This goes to show that any bugs can be used in the right hands!

Maxwell Dulin

About Project Blog Resources

Resources People often ask me "How did you learn how to hack?" The answer: by reading. This page is a collection of the blog posts and other articles that I have accumulated over the years of my journey. Enjoy!

An EPYC escape: Case-study of a KVM breakout - 540

Resources
People often ask me "How did you learn how to hack?" The answer: by reading. This page is a collection of the blog posts and other articles that I have accumulated over the years of my journey. Enjoy!