Cover Illustration by t0meku

Traditionally, x86 processors lacked built-in virtualization support, leading to significant challenges when implementing efficient Virtual Machine Monitors (VMMs) or hypervisors.

But then Intel made Intel VT as a hardware-based solution for virtualization acceleration. With VT-x, Intel aimed to reduce hypervisor complexity by offloading certain virtualization tasks to hardware, VT-x allows for the development of simpler, more reliable VMMs.

This hardware solution also improves separation between different virtual machines and between virtual machines and the VMM, crucial for security and stability in multi-tenant environments.

In this article, I'll be developing a hypervisor using Intel's VT-x virtualization technology because i got bored i guess idk and i tried reading the Intel Architecture SDM documentation. But i must say i wasn't strong enough to thug it all out and some parts of this article is also based on this Sina Karvandi's tutorial on how to build a Hypervisor from scratch with Intel VT.

Verify CPU support for VT-x

The VMM (Virtual Machine Monitor) startup process is described in detail in Chapter 31.5 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3C: System Programming Guide, Part 3, Section 31.5.

But to enable and initialize virtualization technology (VT) on a CPU, we must first verify VT support using the cpuid command and specific MSR registers. Next, allocate 4KB-aligned non-paged memory for the vmxon area, with the size specified in the IA32_VMX_BASIC register.

Then we initialize the vmxon area by setting its version number, obtained from the lower 4 bytes of IA32_VMX_BASIC (typically 1). Then, configure cr0 and cr4 registers according to the requirements set by IA32_VMX_CR0_FIXED0, IA32_VMX_CR0_FIXED1, IA32_VMX_CR4_FIXED0, and IA32_VMX_CR4_FIXED1 registers.

We need to then ensure the IA32_FEATURE_CONTROL register is correctly set, with bits 0 and 2 both set to 1. This can be verified by reading the register and performing a bitwise AND with 5. Finally, execute the vmxon instruction, passing a pointer to the physical address of the vmxon area. The instruction's success is indicated by rflags.cf=0, while failure results in rflags.cf=1.

// Check if BIOS has enabled VT
BOOLEAN VmxIsCheckSupportVTBIOS(VOID)
{
    ULONG64 value = __readmsr(IA32_FEATURE_CONTROL);
    return (value & 0x5) == 0x5;
}

// Check if CPU supports VT
BOOLEAN VmxIsCheckSupportVTCPUID(VOID)
{
    int cpuidInfo[4];
    __cpuidex(cpuidInfo, 1, 0);
    // CPUID ECX.VMX[bit 5] = 1 if VT is supported
    return (cpuidInfo[2] & (1 << 5)) != 0;
}

// Check if VT is enabled in CR4
BOOLEAN VmxIsCheckSupportVTCr4(VOID)
{
    ULONG64 cr4 = __readcr4();
    // CR4.VMXE[bit 13] = 1 if VT is enabled
    return (cr4 & (1 << 13)) != 0;
}

VOID CheckVT(VOID)
{
    KIRQL oldIrql;
    PROCESSOR_NUMBER procNumber;
    KeGetCurrentProcessorNumberEx(&procNumber);
    oldIrql = KeRaiseIrqlToDpcLevel();

    if (VmxIsCheckSupportVTCPUID())
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID, DPFLTR_INFO_LEVEL,
            "[INFO]: CPU supports VT (Processor %u:%u)\n",
            procNumber.Group, procNumber.Number);
    }
    if (VmxIsCheckSupportVTBIOS())
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID, DPFLTR_INFO_LEVEL,
            "[INFO]: BIOS has enabled VT (Processor %u:%u)\n",
            procNumber.Group, procNumber.Number);
    }
    if (VmxIsCheckSupportVTCr4())
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID, DPFLTR_INFO_LEVEL,
            "[INFO]: VT is enabled in CR4 (Processor %u:%u)\n",
            procNumber.Group, procNumber.Number);
    }

    KeLowerIrql(oldIrql);
}

VMXON Execution

Next we're gonna need to start with VMXON Execution, which transitions the logical processor from normal operation into VMX root operation. We first need to allocate a 4KB-aligned, non-paged memory block for the VMXON region. This alignment is crucial for the proper functioning of virtualization features, and non-paged memory ensures that the VMXON region is always accessible and not swapped out to disk.

To allocate this memory, you typically use a function like MmAllocateContiguousMemorySpecifyCache. This function allows you to specify the size of the allocation (which should be PAGE_SIZE, or 4KB), as well as the physical address range within which the allocation should occur.

PHYSICAL_ADDRESS lowPhys, highPhys;
lowPhys.QuadPart = 0;
highPhys.QuadPart = -1;
pVcpu->VmxOnRegion = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE, lowPhys, highPhys, MmCached);
if (!pVcpu->VmxOnRegion)
{
    // Handle allocation failure
    return STATUS_INSUFFICIENT_RESOURCES;
}

pVcpu->VmxOnRegionPhys = MmGetPhysicalAddress(pVcpu->VmxOnRegion);

We can set the lower bound of the physical address to 0 and the upper bound to -1 (which effectively means the highest possible address), allowing the system to allocate the memory anywhere in physical memory that meets our requirements.

Once the memory is allocated, it needs to be properly initialized. This process is described in detail in section 25.11.5 of the Intel Software Developer's Manual. The VMXON region is a block of memory that will be used by the operating system to manage the virtual machine.

After allocating the memory, you need to use a function like rtlzeromemory to clear the entire region. This step is crucial to prevent any potential issues caused by leftover data in the memory pages. All bits in this region should be set to zero, with one important exception: the first four bytes of the VMXON region.

The first four bytes of the VMXON region must be filled with the VMX revision identifier. This identifier is stored in the IA32_VMX_BASIC Model Specific Register (MSR). You can read this MSR and use its lower 32 bits as the revision identifier.

ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
*(PULONG)pVmxonRegion = (ULONG)vmxBasic;

It's worth noting that the lower 4 bytes of this register typically contain the value 1. However, this may change in future CPU versions, so it's always best to read the actual value from the MSR rather than hardcoding it.

The VMXON region has several important requirements that must be met:

It must be 4KB aligned, which we ensured during the memory allocation step.
It should not set any bits beyond the processor's physical address width. This is typically not an issue if you're using the allocation method described above.
The first 4 bytes should contain the VMCS revision identifier, which we set using the code above.

Before executing VMXON, we must configure the CR0 and CR4 control registers according to the requirements specified by four MSRs: IA32_VMX_CR0_FIXED0, IA32_VMX_CR0_FIXED1, IA32_VMX_CR4_FIXED0, and IA32_VMX_CR4_FIXED1.

For CR0, any bit that is set to 1 in IA32_VMX_CR0_FIXED0 must be set to 1 and any bit that is set to 0 in IA32_VMX_CR0_FIXED1 must be set to 0. Similarly, for CR4, any bit that is set to 1 in IA32_VMX_CR4_FIXED0 must be set to 1 and any bit that is set to 0 in IA32_VMX_CR4_FIXED1 must be set to 0. To implement these rules, you would typically read the current values of CR0 and CR4, read the values of the fixed MSRs, and then adjust CR0 and CR4 accordingly.

ULONG64 cr0 = __readcr0();
ULONG64 cr4 = __readcr4();
ULONG64 cr0_fixed0 = __readmsr(IA32_VMX_CR0_FIXED0);
ULONG64 cr0_fixed1 = __readmsr(IA32_VMX_CR0_FIXED1);
ULONG64 cr4_fixed0 = __readmsr(IA32_VMX_CR4_FIXED0);
ULONG64 cr4_fixed1 = __readmsr(IA32_VMX_CR4_FIXED1);

cr0 = (cr0 | cr0_fixed0) & cr0_fixed1;
cr4 = (cr4 | cr4_fixed0) & cr4_fixed1;

__writecr0(cr0);
__writecr4(cr4);

This code ensures that all required bits are set in CR0 and CR4, and all bits that must be clear are cleared, while preserving the values of other bits that don't need to be changed. After completing all those steps, we are finally ready to execute the VMXON instruction.

int error = __vmx_on(&pVmxonRegion->PhysicalAddress.QuadPart);

The __vmx_on function is an intrinsic provided by most compilers that support Intel VT-x. It takes a pointer to the physical address of the VMXON region we prepared earlier.

It's important to note that there are additional checks and requirements specified in the Intel manual, particularly in section 26.3.1.1.

The IA32_FEATURE_CONTROL MSR must be properly configured to allow VMXON to execute.
Bit 5 of CR4 (corresponding to Physical Address Extension) must be set to 1 for VMXON to succeed. The CR0 bits corresponding to Protected Mode (bit 0) and Paging (bit 31) must also be set to 1.
The IA32_FEATURE_CONTROL_MSR must be properly configured to allow VMXON to execute. Specifically, bit 0 (the lock bit) must be set, and either bit 1 (enabling VMXON outside of SMX operation) or bit 2 (enabling VMXON inside SMX operation) must be set, depending on your specific use case.
If you're operating in 64-bit mode, you need to ensure that the IA32_EFER.LMA bit is set to 1, indicating that the processor is in IA-32e mode.

These additional checks and requirements ensure that the processor is in the correct state to begin VMX operation. Failing to meet any of these requirements will cause the VMXON instruction to fail, potentially with a #GP (General Protection) exception.

Allocating VMCS Memory and VMCLEAR Execution

After entering VMX operation mode, we then need to set up the Virtual Machine Control Structure (VMCS). The VMCS is a data structure in memory that controls the behavior of a virtual machine in Intel VT-x. It stores the guest state, host state, and control information for a virtual machine.

This process bears similarities to the allocation of the VMXON region, but it comes with its own set of specific requirements. The memory allocated for the VMCS must be 4KB aligned and non-paged, ensuring that it remains accessible at all times and isn't swapped out to disk. This alignment and memory type are crucial for the proper functioning of the virtualization features.

Once the memory is allocated, it needs to be initialized. The first 4 bytes of this newly allocated memory block play a special role. They must be filled with the lower 4 bytes of the IA32_VMX_BASIC Model Specific Register (MSR). This initialization step is critical as it sets up the VMCS with the correct version identifier, ensuring compatibility with the processor's VMX implementation.

The VMCS is a complex data structure that will eventually store a wide array of information about the virtual machine. This includes various registers, control areas, and even the vmexit control area. However, at this stage, we're only setting up the basic structure. The detailed configuration of these areas will come later, through the use of the VMWRITE instruction.

PHYSICAL_ADDRESS lowPhys, highPhys;
lowPhys.QuadPart = 0;
highPhys.QuadPart = -1;
pVcpu->VmcsRegion = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE, lowPhys, highPhys, MmCached);
if (!pVcpu->VmcsRegion)
{
    // Handle allocation failure
    return STATUS_INSUFFICIENT_RESOURCES;
}

RtlZeroMemory(pVcpu->VmcsRegion, PAGE_SIZE);
pVcpu->VmcsRegionPhys = MmGetPhysicalAddress(pVcpu->VmcsRegion);

// Initialize VMCS with revision identifier
ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
*(PULONG)pVcpu->VmcsRegion = (ULONG)vmxBasic;

After allocating and performing the basic initialization of the VMCS memory, the next step in our process is to use the VMCLEAR instruction. VMCLEAR serves several important purposes in the setup of our virtual machine environment. First, it initializes the VMCS, setting its launch state to "clear". This is a necessary step before the VMCS can be used for a virtual machine. Second, VMCLEAR invalidates any cached VMCS data that the processor might be holding from previous uses of this VMCS. This ensures that we're starting with a clean slate. Finally, VMCLEAR ensures that all VMCS data is written to the VMCS region in memory, maintaining data consistency.

The VMCLEAR instruction is relatively simple to use. It takes a pointer to the physical address of the VMCS as its operand.

int error = __vmx_vmclear(&pVcpu->VmcsRegionPhys.QuadPart);
if (error)
{
    // Handle VMCLEAR failure
    return STATUS_UNSUCCESSFUL;
}

It's crucial to check the return value of VMCLEAR. If it fails, it's an indication that something has gone wrong in our VMX setup, and we should not proceed with further VMX operations using this VMCS. Only then we go make this VMCS the current VMCS using the VMPTRLD instruction:

cCopyerror = __vmx_vmptrld(&pVcpu->VmcsRegionPhys.QuadPart);
if (error)
{
    // Handle VMPTRLD failure
    return STATUS_UNSUCCESSFUL;
}

This step is crucial because it makes the VMCS active and current, allowing us to use VMREAD and VMWRITE instructions to configure it in the subsequent steps.

Setup VMCS

As the VMCS acts as the interface between the hypervisor and VM, controlling how the virtual environment operates, we need to configure VMCS fields to determine the guest's perceived hardware state and set rules for VM exits and entries.

Chapter 24.3 of the Intel white paper describes the fields in the VMCS control area in detail. As mentioned in the previous article, the vmcs fields that need to be set are:

Guest state fields, when VT exits from the virtual machine, the processor status (registers, etc.) will be stored in this area. When entering the virtual machine (turning on VT), the status of various processors in the virtual machine is determined by the value of the corresponding field in this area when entering the virtual machine.
Host state fields, when exiting from a virtual machine, the host takes over the CPU. After the host takes over, the states of various registers are stored in this area. That is to say, when a vm-exit event occurs in the virtual machine, the CPU will return from the guest to the host, and the values in this area will be set to the corresponding registers, and then continue to execute according to the eip after the setting.
VM-execution control fields
VM-exit control fields
VM-entry control fields

In addition to these five areas, vmcs also has an area called the vm exit information area. This area is read-only and stores the number of the failure code after the vmx instruction fails.

Now there are four important fields that need to be obtained when filling the guest and host areas. They are rip and rsp after entering the guest area, and rip rsp after returning from the guest to the host area. Here we want to make the system continue to operate normally after entering the guest area, or run down from the original place. Therefore, the return address and rsp of the upper layer function to be returned must be obtained through the function. As for the host eip after returning to the host area, since the return from the virtual machine must be a vmexit event, the event needs to be processed, so the rip after returning from the virtual machine must be set to the processing function of the vmexit event. And rsp needs to open up a new memory area for the vmexit event processing function. If the stack before the guest returns is still used, the contents of the stack will be destroyed, resulting in unpredictable results.

Before initializing the vmcs area, we must first obtain the rip and rsp of the guest after entering the guest. Since we hope that the virtual machine can continue to execute on the code before we enter the guest after entering the guest, we need to obtain the return address of the vmxinit function and the rsp of the previous function saved in the stack. After entering the guest, start running directly from the return address of the vmxinit function, and set the stack to the stack of the previous function. Here we need to use a vs embedded function _AddressOfReturnAddress. This function will return a pointer to the return address of the previous function in the stack during compilation. Therefore, through this function, we can get a pointer to the rip that needs to be used. The rsp of the previous function is 8 bytes below the location where rip is stored. Therefore, the code to obtain the rip and rsp of the guest function is as follows:

PULONG64 retAddr = (PULONG64)_AddressOfReturnAddress();
ULONG64 guestEsp = retAddr + 1;
ULONG64 guestEip = *retAddr;

Therefore, the general framework of the vmxinit function is as follows. hosteip passes in the address of the vmexit processing function. After a vmexit event occurs, it jumps to the vmexit processing function for corresponding processing.

int VmxInit(ULONG64 hostEip)
{
    PVMXCPUPCB pVcpu = VmxGetCurrentCPUPCB();
    pVcpu->cpuNumber = KeGetCurrentProcessorNumberEx(NULL);

    PULONG64 retAddr = (PULONG64)_AddressOfReturnAddress();
    ULONG64 guestEsp = (ULONG64)(retAddr + 1);
    ULONG64 guestEip = *retAddr;

    int error = VmxInitVmOn();
    if (error)
    {
        DbgPrintEx(77, 0, "[db]:vmon initialization failed error = %d, cpunumber %d\r\n", error, pVcpu->cpuNumber);
        return error;
    }

    error = VmxInitVmcs(guestEip, guestEsp, hostEip);
    if (error)
    {
        DbgPrintEx(77, 0, "[db]:vmcs initialization failed error = %d, cpunumber %d\r\n", error, pVcpu->cpuNumber);
        VmxDestory();
        return error;
    }

    return 0;
}

Then we need to set up the VMCS fields through vmxinitvmcs. Similar to setting the vmon area, first apply for a memory area and then fill in IA32_VMX_BASIC. After filling in the basic ID, initialize the memory through vmclear and select the vmcs area through vmptrld. These two steps correspond to unplugging the power and selecting the machine mentioned in the previous article. After completion, the most complex vmcs field is filled. Here, for each vmcs field, a function is encapsulated for initialization.

// VmxInitVmcs function
int VmxInitVmcs(ULONG64 GuestEip, ULONG64 GuestEsp, ULONG64 hostEip)
{
    PVMXCPUPCB pVcpu = VmxGetCurrentCPUPCB();
    PHYSICAL_ADDRESS lowphys, heiPhy;
    lowphys.QuadPart = 0;
    heiPhy.QuadPart = -1;

    pVcpu->VmxcsAddr = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE, lowphys, heiPhy, lowphys, MmCached);
    if (!pVcpu->VmxcsAddr)
    {
        return -1;  // Memory allocation failed
    }

    memset(pVcpu->VmxcsAddr, 0, PAGE_SIZE);
    pVcpu->VmxcsAddrPhys = MmGetPhysicalAddress(pVcpu->VmxcsAddr);

    pVcpu->VmxHostStackTop = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE * 36, lowphys, heiPhy, lowphys, MmCached);
    if (!pVcpu->VmxHostStackTop)
    {
        return -1;  // Memory allocation failed
    }

    memset(pVcpu->VmxHostStackTop, 0, PAGE_SIZE * 36);
    pVcpu->VmxHostStackBase = (ULONG64)pVcpu->VmxHostStackTop + PAGE_SIZE * 36 - 0x200;

    // Fill in ID
    ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
    *(PULONG)pVcpu->VmxcsAddr = (ULONG)vmxBasic;

    // Load VMCS
    __vmx_vmclear(&pVcpu->VmxcsAddrPhys.QuadPart);
    __vmx_vmptrld(&pVcpu->VmxcsAddrPhys.QuadPart);

    VmxInitGuest(GuestEip, GuestEsp);
    VmxInitHost(hostEip);

    return 0;
}

For the guest-related fields, guesteip and guestesp need to be passed in to determine where the guest starts running after entering the guest virtual machine. All other fields are filled in according to the current status. The first thing to fill in is the base, limit, attribute, and selector of each segment register in the GDT table.

After observing its ID, it can be found that the IDs of these fields are connected together, and the ID values differ by 2. For these segment registers, the method of separating base, limit, attribute, and selector is also very similar. Therefore, it is possible to consider encapsulating the method of filling in the attributes of the segment register into a function. Here it is encapsulated into fillGdtDataItem function. For the separation of each attribute, follow the figure below. The specific details of the separation are not repeated, and it is recommended to carefully read the method of cutting bits in the code.

void fillGdtDataItem(int index, short selector)
{
    GdtTable gdtTable = {0};
    AsmGetGdtTable(&gdtTable);
    selector &= 0xFFF8;
    ULONG limit = __segmentlimit(selector);
    PULONG item = (PULONG)(gdtTable.Base + selector);
    LARGE_INTEGER itemBase = {0};
    itemBase.LowPart = (*item & 0xFFFF0000) >> 16;
    item++;
    itemBase.LowPart |= (*item & 0xFF000000) | ((*item & 0xFF) << 16);

    // Set attributes
    ULONG attr = (*item & 0x00F0FF00) >> 8;
    if (selector == 0)
    {
        attr |= 1 << 16;
    }

    __vmx_vmwrite(GUEST_ES_BASE + index * 2, itemBase.QuadPart);
    __vmx_vmwrite(GUEST_ES_LIMIT + index * 2, limit);
    __vmx_vmwrite(GUEST_ES_AR_BYTES + index * 2, attr);
    __vmx_vmwrite(GUEST_ES_SELECTOR + index * 2, selector);
}

The GDT entry of the tr register cannot be filled like other registers. Because in 64-bit, the GDT entry of the tr register is 128 bits. Therefore, it needs to be set separately. The format of the GDT entry of the tr register in 64-bit is explained in Chapter 3.2.1 of this Intel white paper.

The idea is the same as the setting idea of other GDT table items, which is to take out the corresponding bits and fill them into the vmcs area.

GdtTable gdtTable = { 0 };
AsmGetGdtTable(&gdtTable);
ULONG trSelector = AsmReadTR();
trSelector &= 0xFFF8;
ULONG trlimit = __segmentlimit(trSelector);
LARGE_INTEGER trBase = {0};
PULONG trItem = (PULONG)(gdtTable.Base + trSelector);

Next is the filling of some other special registers. I won't go into detail here. There are special properties of some special registers here can be used for virtual machine detection. After performing some special operations, the results in the host state and the guest state can be different, so as to detect the existence of VT.

For example, in the subsequent msr settings, if you try to read a register that exceeds the msr range in the guest, an error will be thrown in the real machine, but if it is not handled specifically in the virtual machine, unpredictable results will occur. Although Intel guarantees that it cannot detect whether it is a guest in the guest, there are still many ways to perform corresponding detection.

__vmx_vmwrite(GUEST_CR0, __readcr0());
__vmx_vmwrite(GUEST_CR4, __readcr4());
__vmx_vmwrite(GUEST_CR3, __readcr3());
__vmx_vmwrite(GUEST_DR7, __readdr(7));
__vmx_vmwrite(GUEST_RFLAGS, __readeflags());
__vmx_vmwrite(GUEST_RSP, GuestEsp);
__vmx_vmwrite(GUEST_RIP, GuestEip);
__vmx_vmwrite(VMCS_LINK_POINTER, -1LL);
__vmx_vmwrite(GUEST_IA32_DEBUGCTL, __readmsr(IA32_MSR_DEBUGCTL));
__vmx_vmwrite(GUEST_IA32_PAT, __readmsr(IA32_MSR_PAT));
__vmx_vmwrite(GUEST_IA32_EFER, __readmsr(IA32_MSR_EFER));
__vmx_vmwrite(GUEST_FS_BASE, __readmsr(IA32_FS_BASE));
__vmx_vmwrite(GUEST_GS_BASE, __readmsr(IA32_GS_BASE));
__vmx_vmwrite(GUEST_SYSENTER_CS, __readmsr(0x174));
__vmx_vmwrite(GUEST_SYSENTER_ESP, __readmsr(0x175));
__vmx_vmwrite(GUEST_SYSENTER_EIP, __readmsr(0x176));

For the Host area filling, the content filled in the host area is similar to that filled in the guest area. Note that the gdt table item in the host does not need to fill in all attributes, only the selector. Another point to note is that the host's rsp must use a piece of memory applied for by itself. If you still use the rsp when the guest exits, it will definitely cause the stack in the guest to be destroyed, resulting in unpredictable results. The code for filling the host area is as follows:

void VmxInitHost(ULONG64 HostEip)
{
    GdtTable gdtTable = { 0 };
    AsmGetGdtTable(&gdtTable);
    PVMXCPUPCB pVcpu = VmxGetCurrentCPUPCB();
    ULONG trSelector = AsmReadTR();
    trSelector &= 0xFFF8;
    LARGE_INTEGER trBase = { 0 };
    PULONG trItem = (PULONG)(gdtTable.Base + trSelector);

    // Read TR
    trBase.LowPart = ((trItem[0] >> 16) & 0xFFFF) | ((trItem[1] & 0xFF) << 16) | ((trItem[1] & 0xFF000000) >> 8);
    trBase.HighPart = trItem[2];

    // Set TR
    __vmx_vmwrite(HOST_TR_BASE, trBase.QuadPart);
    __vmx_vmwrite(HOST_TR_SELECTOR, trSelector);

    // Set segment selectors
    __vmx_vmwrite(HOST_ES_SELECTOR, AsmReadES() & 0xfff8);
    __vmx_vmwrite(HOST_CS_SELECTOR, AsmReadCS() & 0xfff8);
    __vmx_vmwrite(HOST_SS_SELECTOR, AsmReadSS() & 0xfff8);
    __vmx_vmwrite(HOST_DS_SELECTOR, AsmReadDS() & 0xfff8);
    __vmx_vmwrite(HOST_FS_SELECTOR, AsmReadFS() & 0xfff8);
    __vmx_vmwrite(HOST_GS_SELECTOR, AsmReadGS() & 0xfff8);

    // Set control registers
    __vmx_vmwrite(HOST_CR0, __readcr0());
    __vmx_vmwrite(HOST_CR4, __readcr4());
    __vmx_vmwrite(HOST_CR3, __readcr3());

    // Set RSP and RIP
    __vmx_vmwrite(HOST_RSP, (ULONG64)pVcpu->VmxHostStackBase);
    __vmx_vmwrite(HOST_RIP, HostEip);

    // Set MSRs
    __vmx_vmwrite(HOST_IA32_PAT, __readmsr(IA32_MSR_PAT));
    __vmx_vmwrite(HOST_IA32_EFER, __readmsr(IA32_MSR_EFER));
    __vmx_vmwrite(HOST_FS_BASE, __readmsr(IA32_FS_BASE));
    __vmx_vmwrite(HOST_GS_BASE, __readmsr(IA32_GS_KERNEL_BASE));
    __vmx_vmwrite(HOST_IA32_SYSENTER_CS, __readmsr(0x174));
    __vmx_vmwrite(HOST_IA32_SYSENTER_ESP, __readmsr(0x175));
    __vmx_vmwrite(HOST_IA32_SYSENTER_EIP, __readmsr(0x176));

    // Set GDT and IDT
    GdtTable idtTable;
    __sidt(&idtTable);
    __vmx_vmwrite(HOST_GDTR_BASE, gdtTable.Base);
    __vmx_vmwrite(HOST_IDTR_BASE, idtTable.Base);
}

VM-Entry Controls

Then in Chapter 24.8.1 of "Processor Virtualization Technology", it explains in detail the filling of vm-entry control class fields and their corresponding attributes. During vm-entry, if the CPU detects that these fields have not been correctly filled, it will throw an error and exit.

The VM_ENTRY_CONTROLS field is 32 bits long, with each bit corresponding to a control function. It controls some operations performed by the processor when entering the virtual machine, such as:

Whether to load dr0~dr7 registers when entering the virtual machine
Whether to enter IA-32e mode when loading
Whether to load IA32_PERF_GLOBAL_CTRL, IA32_PAT, IA32_EFER registers, etc.

The specific role of each bit is shown in Table 3-9 in the book. This book was written quite early, and the CPU may have added some other fields. For details, please refer to the relevant chapters in the Intel white paper.

After examining this table, you'll notice that some positions are fixed to 1, and some are fixed to 0. Some of these positions are not yet used and are reserved for future expansion of functions. These bits may no longer be fixed to 1 or 0 in the future, but used to control newly introduced functions. Therefore, we cannot directly write the bits fixed to 0 or 1. We need to calculate the bits fixed to 0 and 1 according to an algorithm and fill them into the VM_ENTRY_CONTROLS register.

To achieve this we would need to the IA32_VMX_BASIC register and then check its 55th bit if it's 1, use the register with "TRUE" on the right side of the table for all subsequent operations and if it's 0, use the register without "TRUE" on the left side.

In practice, many modern computers use the register with "TRUE". However, for compatibility, it's necessary to check which group of registers should be used each time.

The method of setting fixed bits is described in detail in Section 2.5.5 of the book. The IA32_MSR_VMX_TRUE_ENTRY_CTLS register is a 64-bit register, and the VM_ENTRY_CONTROLS that needs to be set is a 32-bit register.

When a certain bit in the lower 32 bits of IA32_MSR_VMX_TRUE_ENTRY_CTLS is 1, the corresponding bit in the VM_ENTRY_CONTROLS register must be 1. When a certain bit in the upper 32 bits of IA32_MSR_VMX_TRUE_ENTRY_CTLS is 0, the corresponding bit in the VM_ENTRY_CONTROLS register must be 0.

ULONG64 VmxAdjustControls(ULONG64 value, ULONG64 msr)
{
    LARGE_INTEGER msrValue;
    msrValue.QuadPart = __readmsr(msr);
    value = (msrValue.LowPart | value) & msrValue.HighPart;
    return value;
}

When first building the framework, there's no need to process other fields. Only the 9th bit needs to be filled in the custom field to enter IA-32e mode. Other bits can be left unset at the beginning. However, this doesn't mean these bits are unimportant. For example, the 2nd bit specifies whether to load the current dr register when entering the virtual machine. Reasonable use of this function may implement some special debugging functions.

void ConfigureVmEntryControls()
{
    ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
    ULONG64 msr = ((vmxBasic >> 55) & 1) ? IA32_MSR_VMX_TRUE_ENTRY_CTLS : IA32_MSR_VMX_ENTRY_CTLS;
    ULONG64 entryControls = VmxAdjustControls(0x200, msr);  // 0x200 enables IA-32e mode
    __vmx_vmwrite(VM_ENTRY_CONTROLS, entryControls);
}

Chapter 3.6.2 of "Processor Virtualization Technology" describes the MSR-load field. These two fields control whether to load the msr register when entering the virtual machine. Here, we don't need to load the msr register when entering the virtual machine because VM exits and entries are frequent. Loading the msr register every time would reduce performance. If we want to intercept or hook the msr register, there are other methods. Therefore, these two fields can be filled with 0.

Then we go to the VM_ENTRY_INTR_INFO_FIELD, which is described in Chapter 3.6.3.1. Its general role is that after filling this field according to certain rules, the corresponding interrupt or exception will be triggered after entering the virtual machine. We don't need to use this function for now. If the highest bit is set to 0, this field is considered invalid. Therefore, this field can be filled with 0 directly. When we need to use this function in the future, we can deal with it accordingly.

void InitializeVmEntrySettings()
{
    ConfigureVmEntryControls();
    __vmx_vmwrite(VM_ENTRY_MSR_LOAD_COUNT, 0);
    __vmx_vmwrite(VM_ENTRY_INTR_INFO_FIELD, 0);
}

VM-Exit Controls

The vm-exit control field is very similar to the vm-entry field. It specifies the operations to be performed when exiting the virtual machine. The operations performed during vm-entry and vm-exit can be corresponded. When vm-exit saves the msr register, vm-entry can load the msr register.

In "Section 24.7.1, VM-Exit Controls" describes the filling rules of vm-exit fields. It generally corresponds to the vm-entry filling rules. Two points to note:

The 15th bit (acknowledge interrupt on exit) specifies whether to read and save the interrupt vector number when exiting due to external interrupts. You can fill in 0 or 1 without affecting the use, but to use this saved information in the future, you can fill it with 1, which won't affect performance.
The 22nd bit has a device similar to a timer. However, many CPUs do not support this function. For compatibility, it's recommended not to use this function.

Then we get to the VM-Execution Control fields, which used to set which events to intercept and which not to intercept.

void InitializeVmExecutionControls()
{
    ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
    ULONG64 pinBasedMsr = ((vmxBasic >> 55) & 1) ? IA32_MSR_VMX_TRUE_PINBASED_CTLS : IA32_MSR_VMX_PINBASED_CTLS;
    ULONG64 procBasedMsr = ((vmxBasic >> 55) & 1) ? IA32_MSR_VMX_TRUE_PROCBASED_CTLS : IA32_MSR_VMX_PROCBASED_CTLS;

    ULONG64 pinBasedControls = VmxAdjustControls(0, pinBasedMsr);
    ULONG64 procBasedControls = VmxAdjustControls(0, procBasedMsr);

    __vmx_vmwrite(PIN_BASED_VM_EXEC_CONTROL, pinBasedControls);
    __vmx_vmwrite(CPU_BASED_VM_EXEC_CONTROL, procBasedControls);
}

And in the previous VMCS writing process, after a VM exit event occurs and returns to the host, the RIP is set to the address of the VM-exit processing function. This function must save all registers at the beginning and restore them before returning to the virtual machine. Otherwise, if the register contents differ before exiting and after returning to the virtual machine, it will lead to unpredictable results. Therefore, this function must be a naked function written in assembly.

VmExitHandlerAsm PROC
    push r15;
    push r14;
    push r13;
    push r12;
    push r11;
    push r10;
    push r9;
    push r8;
    push rdi;
    push rsi;
    push rbp;
    push rsp;
    push rbx;
    push rdx;
    push rcx;
    push rax;

    mov rcx,rsp;
    sub rsp,0100h
    call VmxExitHandler
    add rsp,0100h;

    pop rax;
    pop rcx;
    pop rdx;
    pop rbx;
    pop rsp;
    pop rbp;
    pop rsi;
    pop rdi;
    pop r8;
    pop r9;
    pop r10;
    pop r11;
    pop r12;
    pop r13;
    pop r14;
    pop r15;
    vmresume
    ret
AsmVmxExitHandler endp

The process is as follows:

Save all registers
Call a C function (VmExitHandlerC) for detailed event handling
Restore all registers
Resume VM execution using the vmresume instruction

Then we need to figure out the causes of vm-exit are described. It points out the instructions that will unconditionally cause vmexit events. In the virtual machine, executing all instructions except VMFUNC will unconditionally cause VMEXIT events. Additionally, CPUID, GETSEC, INVD, and XSETBV instructions will also unconditionally cause VMEXIT events.

In 24.9.1 Basic VM-Exit Information, you can find the corresponding vmexit information fields in the control area. These include exit reason, instruction length causing the exit, and instruction information. The vm-instruction error field in the read-only fields will be set when a vm instruction fails.

The exit reason field composition is described in 3.10.1.1. Bits 0-15 indicate the reason for vm exit, and other bits have other indicating functions. We should extract the other bits and only judge the vm exit reason through bits 0-15.

The vmexit event handling function framework includes:

Getting instruction length, instruction information, EIP ESP
Getting the event code
Handling the event accordingly
Incrementing rip by the instruction length
Writing back rip and rsp and returning to continue execution at the next instruction

Since we don't plan to implement VT nesting, we need to return an error for vmx events in the guest. When starting VT, if an error occurs, CF and ZF are not both 0. Only when the vmx instruction is successfully executed will CF and ZF both be set to 0. To let the virtual machine realize it cannot continue to enter the VT environment, we need to set CF and ZF to 1.

// VM Exit Handler
EXTERN_C VOID VmxExitHandler(PGuestContext context)
{
    ULONG64 reason = 0;
    ULONG64 instLen = 0;
    ULONG64 instinfo = 0;
    ULONG64 mrip = 0;
    ULONG64 mrsp = 0;

    __vmx_vmread(VM_EXIT_REASON, &reason);
    __vmx_vmread(VM_EXIT_INSTRUCTION_LEN, &instLen);
    __vmx_vmread(VMX_INSTRUCTION_INFO, &instinfo);
    __vmx_vmread(GUEST_RIP, &mrip);
    __vmx_vmread(GUEST_RSP, &mrsp);

    reason = reason & 0xFFFF;

    switch (reason)
    {
        case EXIT_REASON_CPUID:
        case EXIT_REASON_GETSEC:
        case EXIT_REASON_INVD:
        case EXIT_REASON_VMCALL:
        case EXIT_REASON_VMCLEAR:
        case EXIT_REASON_VMLAUNCH:
        case EXIT_REASON_VMPTRLD:
        case EXIT_REASON_VMPTRST:
        case EXIT_REASON_VMREAD:
        case EXIT_REASON_VMRESUME:
        case EXIT_REASON_VMWRITE:
        case EXIT_REASON_VMXOFF:
        case EXIT_REASON_VMXON:
        case EXIT_REASON_MSR_READ:
        case EXIT_REASON_MSR_WRITE:
        case EXIT_REASON_XSETBV:
            // Handle these events
            break;
    }

    __vmx_vmwrite(GUEST_RIP, mrip + instLen);
    __vmx_vmwrite(GUEST_RSP, mrsp);
}

Next off, cpuid will definitely cause vm-exit events. If there's no need to handle specific behaviors of CPUID, you can simply perform a cpuid in the handling function and return the obtained value to the guest. The handling function is in the host environment, so performing cpuid here will not cause repeated vm-exit events.

// Handle CPUID instruction
VOID VmxHandlerCpuid(PGuestContext context)
{
    if (context->mRax == 0x8888)
    {
        context->mRax = 0x11111111;
        context->mRbx = 0x22222222;
        context->mRcx = 0x33333333;
        context->mRdx = 0x44444444;
    }
    else
    {
        int cpuids[4] = {0};
        __cpuidex(cpuids, context->mRax, context->mRcx);
        context->mRax = cpuids[0];
        context->mRbx = cpuids[1];
        context->mRcx = cpuids[2];
        context->mRdx = cpuids[3];
    }
}

To verify the interception of the cpuid instruction, we can use special values. If the value of rax is 0x8888, we set rax, rbx, rcx, rdx to special values.

Next we need to handle vm-exit events caused by getsec, invd, xsetbv because the getsec instruction is generally not called, except when enabling SGX. We don't need it, so we can temporarily not handle it.

// Handle XSETBV instruction
case EXIT_REASON_XSETBV:
{
    ULONG64 value = MAKE_REG(context->mRax, context->mRdx);
    _xsetbv(context->mRcx, value);
}
break;

For invd, simply perform an invd instruction in the host environment and return. XSETBV is similar, call XSETBV according to the corresponding rules and return. Note that this instruction is 32-bit compatible, you need to combine eax and edx as the second parameter.

Then we need to figure out the communication between guest and host & Closing VT. We can use any event that produces vmexit for communication between the inside and outside of the virtual machine. We use this feature to implement the function of closing VT. We stipulate that when exiting due to the vmcall instruction, if the current rax is 'abcd', then exit the VT environment.

// Set MSR bitmap
BOOLEAN VmxSetReadMsrBitMap(PUCHAR msrBitMap, ULONG64 msrAddrIndex, BOOLEAN isEnable)
{
    if (msrAddrIndex >= 0xC0000000)
    {
        msrBitMap += 1024;
        msrAddrIndex -= 0xC0000000;
    }
    ULONG64 moveByte = 0;
    ULONG64 setBit = 0;
    if (msrAddrIndex != 0)
    {
        moveByte = msrAddrIndex / 8;
        setBit = msrAddrIndex % 8;
        msrBitMap += moveByte;
    }
    if (isEnable)
    {
        *msrBitMap |= 1 << setBit;
    }
    else
    {
        *msrBitMap &= ~(1 << setBit);
    }
    return TRUE;
}

After closing VT, we still need to jump back to the next instruction after the vmcall instruction to continue execution. We need to directly modify rsp and rip to jump back through assembly.

Conditional Virtual Machine Exit Events

There are certain control fields that can cause vmexit events when executing certain instructions or reading and writing certain registers, thus we need to create special conditions and handlers for these vmexits.

If the 28th bit of this register is 1, it indicates that MSR bitmap will be started. When the Use MSR bitmap bit is 1, you can provide a physical address of an MSR bitmap area for the MSR_BITMAP field. After filling it according to certain rules, reading and writing corresponding registers will produce conditional vm-exit events.

// Set MSR bitmap
BOOLEAN VmxSetReadMsrBitMap(PUCHAR msrBitMap, ULONG64 msrAddrIndex, BOOLEAN isEnable)
{
    if (msrAddrIndex >= 0xC0000000)
    {
        msrBitMap += 1024;
        msrAddrIndex -= 0xC0000000;
    }
    ULONG64 moveByte = 0;
    ULONG64 setBit = 0;
    if (msrAddrIndex != 0)
    {
        moveByte = msrAddrIndex / 8;
        setBit = msrAddrIndex % 8;
        msrBitMap += moveByte;
    }
    if (isEnable)
    {
        *msrBitMap |= 1 << setBit;
    }
    else
    {
        *msrBitMap &= ~(1 << setBit);
    }
    return TRUE;
}

The MSR bitmap area is 4KB in size, divided into four 1KB sections controlling read and write access to different ranges of MSR registers. The setting of MSR bitmap is relatively simple, just set the bit you want to intercept to 1. Additionally, you can perform SSDT hook by intercepting c0000082. However, this method has poor compatibility.

Then to make VT support Windows 10, the RDTSCP instruction needs to be handled. If not handled, it will cause a system crash due to a #UD exception.

// Handle RDTSCP instruction
case EXIT_REASON_RDTSCP:
{
    int aunx = 0;
    LARGE_INTEGER in = {0};
    in.QuadPart = __rdtscp(&aunx);
    context->mRax = in.LowPart;
    context->mRdx = in.HighPart;
    context->mRcx = aunx;
}
break;

We need to handle all instructions that might cause #UD exceptions if not handled. The Secondary Processor-Based VM-Execution Controls field needs to be set to enable interception of these instructions. For the rdtscp instruction, it's an upgraded version of RDTSC, used to obtain the CPU time counter in some newer processors.

For the INVPCID instruction, we need to handle the information saved in registers during vm-exit events and call the _invpcid instruction accordingly.

VOID VmxExitInvpcidHandler(PGuestContext context)
{
    ULONG64 mrsp = 0;
    ULONG64 instinfo = 0;
    ULONG64 qualification = 0;
    __vmx_vmread(VMX_INSTRUCTION_INFO, &instinfo); // Get instruction details
    __vmx_vmread(EXIT_QUALIFICATION, &qualification); // Get offset
    __vmx_vmread(GUEST_RSP, &mrsp);
    PINVPCID pinfo = (PINVPCID)&instinfo;
    ULONG64 base = 0;
    ULONG64 index = 0;
    ULONG64 scale = pinfo->scale ? (1 << pinfo->scale) : 0;
    ULONG64 addr = 0;
    ULONG64 regopt = ((PULONG64)context)[pinfo->regOpt];

    if (!pinfo->baseInvaild)
    {
        if (pinfo->base == 4)
        {
            base = mrsp;
        }
        else
        {
            base = ((PULONG64)context)[pinfo->base];
        }
    }

    if (!pinfo->indexInvaild)
    {
        if (pinfo->index == 4)
        {
            index = mrsp;
        }
        else
        {
            index = ((PULONG64)context)[pinfo->index];
        }
    }

    if (pinfo->addrssSize == 0)
    {
        addr = *(PSHORT)(base + index * scale + qualification);
    }
    else if (pinfo->addrssSize == 1)
    {
        addr = *(PULONG)(base + index * scale + qualification);
    }
    else
    {
        addr = *(PULONG64)(base + index * scale + qualification);
    }

    _invpcid(regopt, &addr);
}

The XSAVES instruction also needs to be considered, which is used for saving processor extended states. The behavior of XSAVES is determined by the "enable XSAVES/XRSTORS" VM-execution control bit. If this control is not set (i.e., is 0), XSAVES will cause an invalid-opcode exception (#UD), potentially crashing the system.

When the control is set to 1, the behavior depends on the XSS-exiting bitmap. XSAVES will cause a VM exit if any bit is set in the logical-AND of EDX:EAX, the IA32_XSS MSR, and the XSS-exiting bitmap. Otherwise, it operates normally.

To properly support XSAVES without unnecessary performance overhead, we need to enable it in the VM-execution controls but avoid setting the XSS-exiting bitmap. This approach prevents the #UD exception and allows XSAVES to operate normally without causing VM exits.

cCopyvoid EnableXsavesSupport(void)
{
    ULONG64 secondaryControls = 0;

    // Read current secondary processor-based VM-execution controls
    __vmx_vmread(SECONDARY_VM_EXEC_CONTROL, &secondaryControls);

    // Set the XSAVES enable bit
    secondaryControls |= SECONDARY_EXEC_XSAVES;

    // Write back the updated controls
    __vmx_vmwrite(SECONDARY_VM_EXEC_CONTROL, secondaryControls);

    // Ensure XSS-exiting bitmap is not set
    ULONG64 xssExitingBitmap = 0;
    __vmx_vmwrite(XSS_EXITING_BITMAP, xssExitingBitmap);
}

This function does two key things:

It enables XSAVES support by setting the SECONDARY_EXEC_XSAVES bit in the secondary processor-based VM-execution controls.
It ensures the XSS-exiting bitmap is cleared, preventing unnecessary VM exits when XSAVES is executed.

By implementing this support, we allow the guest OS (such as the Windows kernel) to use the XSAVES instruction without causing VM exits or exceptions. This maintains both functionality and performance in our virtualized environment.

It's worth noting that this approach differs from how we handle some other instructions like RDTSCP or INVPCID, where we might intentionally cause VM exits to emulate or monitor the instruction's behavior. For XSAVES, our goal is to allow it to execute normally within the guest, intervening as little as possible to maintain optimal performance.

Conclusion

This article has covered the essential steps and concepts needed to implement a basic hypervisor using Intel VT-x virtualization technology. We've explored the process of verifying CPU support, initializing the virtual machine environment, configuring the Virtual Machine Control Structure (VMCS), and handling VM entry and exit events. The implementation of a simple hypervisor as described here provides a foundation for understanding the core mechanics of hardware-assisted virtualization.

There are many additional aspects that need to be addressed to create a robust and feature-complete hypervisor. These include implementing memory virtualization through Extended Page Tables (EPT), virtualizing I/O devices, handling interrupts and exceptions in the guest environment, and implementing nested virtualization support. Additionally, performance optimization, security hardening, and support for multiple guest operating systems are crucial considerations for a production-ready hypervisor.

The Basics of Intel VT-x Extensions

Verify CPU support for VT-x

VMXON Execution

Allocating VMCS Memory and VMCLEAR Execution

Setup VMCS

VM-Entry Controls

VM-Exit Controls

Conditional Virtual Machine Exit Events

Conclusion

Comments

More from this blog

Forensics on Network Appliances

Messing Around with GPUs Again

Deepseek's Low Level Hardware Magic

The Elusive Apple Matrix Coprocessor (AMX)

Behind Chrome-Based DLP Plugins

Command Palette

Verify CPU support for VT-x

VMXON Execution

Allocating VMCS Memory and VMCLEAR Execution

Setup VMCS

VM-Entry Controls

VM-Exit Controls

Conditional Virtual Machine Exit Events

Conclusion

Comments

More from this blog