Cover Illustration by buruberrii_

In a previous blog post, i talk briefly about the many methods on how anticheat systems like Vanguard protect against different type of cheating attempts. One of the methods discussed was the detection of malicious PCIe devices. This method has been in the news lately as of writing due to the game “Delta Force” which has caught flak over its overly aggresive detection of DMA and USB devices, a product of Tencent’s AntiCheat Expert (ACE). But what is DMA and why are live service games so against it?

The detection of malicious PCIe devices are crucial due to the advent of Hardware-based DMA Direct Memory Access (DMA) cheats. One of the earliest example of DMA devices could be the Game Genie, that physically interfaces with the game cartridge and the console, enabling it to alter data at the memory level. Nowadays, PCIe-based DMA devices can be bought online, like this LynxDMA + KMBox set for $160 with free express shipping.

As many anticheat used to monitor only for malicious processes inside the kernel, DMA devices allow the cheating to happen outside of the main PC hardware. PCIe cards like the LynxDMA enables direct interaction with a computer's memory by interfacing directly with the system's memory bus, facilitating efficient data movement between memory and peripherals.

Many users also add a Keyboard and Mouse emulator (KMBox) that act as programmable USB development boards that functions as a keyboard and mouse controller. It allows users to run scripts directly on the device's CPU, enabling ordinary keyboards and mice to have programmable macro functions without the need for additional drivers or DLL injections, which ensures that scripts operate independently of the host PC.

Current Detection Methods

Refreshing from our previous escapades, every PCI device has a set of registers commonly referred to as the PCI configuration space. In modern PCI-e devices, an extended configuration space is implemented, which is mapped into the main memory, allowing the system to read/write to the registers. The configuration space consists of a standard header containing information such as the DeviceID, VendorID, Status, and other details.

This configuration space allows querying important information from PCI devices within the device tree using the IRP_MN_READ_CONFIG code, which reads from a PCI device's configuration space.

NTSTATUS
ValidatePciDevices()
{
    NTSTATUS status = STATUS_UNSUCCESSFUL;

    status = EnumeratePciDeviceObjects(PciDeviceQueryCallback, NULL);

    if (!NT_SUCCESS(status))
        DEBUG_ERROR("EnumeratePciDeviceObjects failed with status %x", status);

    return status;
}

Windows splits DEVICE_OBJECTS into two categories: Physical Device Object (PDO) and Functional Device Object (FDO). A PDO represents each device connected to a physical bus and has an associated DEVICE_NODE. In contrast, an FDO represents the device’s functionality and defines how the system interacts with that device in the driver stack. A device stack can have multiple PDOs but only one FDO. To access each PCI device on the system, the anti-cheat system can enumerate all device objects given the PCI FDO, which is managed by pci.sys.

We first retrieve the driver object associated with the PCI driver (pci.sys). It then enumerates all device objects managed by this driver, storing them in an array. For each device object, it checks if the object is a valid Physical Device Object (PDO) by calling the IsDeviceObjectValidPdo function. If it is a valid PDO, the callback routine (PciDeviceQueryCallback) is invoked.

NTSTATUS
EnumeratePciDeviceObjects(_In_ PCI_DEVICE_CALLBACK CallbackRoutine,
                          _In_opt_ PVOID           Context)
{
    NTSTATUS        status             = STATUS_UNSUCCESSFUL;
    UNICODE_STRING  pci                = RTL_CONSTANT_STRING(L"\\Driver\\pci");
    PDRIVER_OBJECT  pci_driver_object  = NULL;
    PDEVICE_OBJECT* pci_device_objects = NULL;
    PDEVICE_OBJECT  current_device     = NULL;
    UINT32          pci_device_objects_count = 0;

    status = GetDriverObjectByDriverName(&pci, &pci_driver_object);

    if (!NT_SUCCESS(status)) {
        DEBUG_ERROR("GetDriverObjectByDriverName failed with status %x",
                    status);
        return status;
    }

    status = EnumerateDriverObjectDeviceObjects(
        pci_driver_object, &pci_device_objects, &pci_device_objects_count);

    if (!NT_SUCCESS(status)) {
        DEBUG_ERROR("EnumerateDriverObjectDeviceObjects failed with status %x",
                    status);
        return status;
    }

    for (UINT32 index = 0; index < pci_device_objects_count; index++) {
        current_device = pci_device_objects[index];

        /* make sure we have a valid PDO */
        if (!IsDeviceObjectValidPdo(current_device)) {
            ObDereferenceObject(current_device);
            continue;
        }

        status = CallbackRoutine(current_device, Context);

        if (!NT_SUCCESS(status))
            DEBUG_ERROR(
                "EnumeratePciDeviceObjects CallbackRoutine failed with status %x",
                status);

        ObDereferenceObject(current_device);
    }

    if (pci_device_objects)
        ExFreePoolWithTag(pci_device_objects, POOL_TAG_HW);

    return status;
}

Then we read the device's configuration space, starting from the PCI_VENDOR_ID_OFFSET, and stores this data in a PCI_COMMON_HEADER structure. The configuration space consists of a standard header containing information such as the DeviceID, VendorID, Status, and other details. The function reads this space using an IRP with the IRP_MN_READ_CONFIG code.

STATIC
NTSTATUS
PciDeviceQueryCallback(_In_ PDEVICE_OBJECT DeviceObject, _In_opt_ PVOID Context)
{
    UNREFERENCED_PARAMETER(Context);

    NTSTATUS          status = STATUS_UNSUCCESSFUL;
    PCI_COMMON_HEADER header = {0};

    status = QueryPciDeviceConfigurationSpace(
        DeviceObject, PCI_VENDOR_ID_OFFSET, &header, sizeof(PCI_COMMON_HEADER));

    if (!NT_SUCCESS(status)) {
        DEBUG_ERROR("QueryPciDeviceConfigurationSpace failed with status %x",
                    status);
        return status;
    }

    if (IsPciConfigurationSpaceFlagged(&header)) {
        DEBUG_VERBOSE("Flagged DeviceID found. Device: %llx, DeviceId: %lx",
                      (UINT64)DeviceObject,
                      header.DeviceID);
        ReportBlacklistedPcieDevice(DeviceObject, &header);
    }
    else {
        DEBUG_VERBOSE("Device: %llx, DeviceID: %lx, VendorID: %lx",
                      DeviceObject,
                      header.DeviceID,
                      header.VendorID);
    }

    return status;
}

Then we can send an IRP (I/O Request Packet) to read the configuration space of the PCI device. We then wait for the IRP to complete and then returns the status of the operation.

STATIC
NTSTATUS
QueryPciDeviceConfigurationSpace(_In_ PDEVICE_OBJECT DeviceObject,
                                 _In_ UINT32         Offset,
                                 _Out_opt_ PVOID     Buffer,
                                 _In_ UINT32         BufferLength)
{
    NTSTATUS           status = STATUS_UNSUCCESSFUL;
    KEVENT             event  = {0};
    IO_STATUS_BLOCK    io     = {0};
    PIRP               irp    = NULL;
    PIO_STACK_LOCATION packet = NULL;

    if (BufferLength == 0)
        return STATUS_BUFFER_TOO_SMALL;

    KeInitializeEvent(&event, NotificationEvent, FALSE);

    /*
     * IO manager will free this IRP when the request is completed
     */
    irp = IoBuildSynchronousFsdRequest(
        IRP_MJ_PNP, DeviceObject, NULL, 0, NULL, &event, &io);

    if (!irp) {
        DEBUG_ERROR("IoBuildSynchronousFsdRequest failed with no status.");
        return STATUS_INSUFFICIENT_RESOURCES;
    }

    packet                = IoGetNextIrpStackLocation(irp);
    packet->MinorFunction = IRP_MN_READ_CONFIG;
    packet->Parameters.ReadWriteConfig.WhichSpace = PCI_WHICHSPACE_CONFIG;
    packet->Parameters.ReadWriteConfig.Offset     = Offset;
    packet->Parameters.ReadWriteConfig.Buffer     = Buffer;
    packet->Parameters.ReadWriteConfig.Length     = BufferLength;

    status = IoCallDriver(DeviceObject, irp);

    if (status == STATUS_PENDING) {
        KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
        status = io.Status;
    }

    if (!NT_SUCCESS(status))
        DEBUG_ERROR("Failed to read configuration space with status %x",
                    status);

    return status;
}

Once the configuration space is read, we can check if the device ID is among the flagged IDs. If the device ID matches any of the flagged IDs, we can report the blacklisted device.

BOOLEAN
IsPciConfigurationSpaceFlagged(_In_ PPCI_COMMON_HEADER Configuration)
{
    for (UINT32 index = 0; index < FLAGGED_DEVICE_ID_COUNT; index++) {
        if (Configuration->DeviceID == FLAGGED_DEVICE_IDS[index])
            return TRUE;
    }

    return FALSE;
}

Now if you think hard, you might realize that there are alot of periperals that use PCIe, such as network cards, thunderbolt devices, and many more. You will also be correct in guessing that blacklisting certain PCIe configuration space signatures have led to some many side effects including disconnecting PCIe-based network cards, PCIe-based DACs, and even thunderbolt docks. This is predictable because PCIe DMA cheating devices sometimes share PCIe controllers with legitimate hardware. This is primarily why i hate anticheat systems, they presume guilt at all times and treat everyone as a threat actor.

VT-d Support for DMA Remapping

Previously, we talked about the basics of Intel’s VT-x extensions and how to use them to make a very basic virtual machine. While VT-x is the hardware feature to enable CPU virtualization, VT-d handles I/O virtualization, allowing direct assignment of physical devices (like GPUs or network cards) to virtual machines while maintaining isolation between them.

One of the main features of VT-d is its support for DMA remapping. At its core, DMA remapping introduces an I/O Memory Management Unit (IOMMU) that intercepts all DMA requests before they reach system memory, enforcing strict access controls and performing necessary address translations through dedicated I/O page tables.

In traditional virtualization without DMA remapping, direct device assignment to virtual machines poses significant security risks. When a device initiates DMA operations, it uses physical addresses, and without remapping capabilities, these devices could potentially access any physical memory location in the system. This unrestricted access creates a critical vulnerability where a compromised device or malicious driver in one VM could read or write to memory regions belonging to other VMs or the hypervisor itself, completely bypassing the CPU's memory protection mechanisms.

DMA remapping solves this fundamental security challenge by introducing a hardware-enforced isolation layer. The IOMMU maintains separate I/O page tables for each device or group of devices, similar to how the CPU's MMU uses page tables for process isolation. When a device initiates a DMA operation, the IOMMU performs address translation using these I/O page tables, converting the addresses used by the device (guest physical addresses in virtualization contexts) into actual system physical addresses. This translation process ensures that devices can only access memory regions explicitly mapped in their assigned I/O page tables.

Modern virtualization platforms leverage DMA remapping to implement sophisticated I/O virtualization features. For instance, in SR-IOV (Single Root I/O Virtualization) configurations, a single physical device can present multiple Virtual Functions (VFs), each assigned to different VMs. DMA remapping ensures that each VF can only access memory regions allocated to its respective VM, preventing cross-VM memory access violations. The hypervisor programs the IOMMU with separate I/O page tables for each VF, establishing strict memory boundaries that are enforced in hardware.

While there are non-VT-d methods like Windows Kernel DMA Protection that operate within the confines of the operating system's security model, these software-based solutions typically rely on driver frameworks and kernel-mode components to validate and control DMA operations. While they can provide adequate protection under normal circumstances, they inherently trust the operating system's integrity and depend on proper driver behavior. The protection mechanisms must be implemented within each driver and validated by the operating system, introducing multiple potential points of failure.

The performance characteristics of hardware-based DMA protection significantly differ from software solutions. VT-d includes dedicated Translation Lookaside Buffers (TLBs) for caching frequently used address translations, minimizing the performance impact of protection checks. The hardware implementation allows DMA operations to proceed at near-native speed once translations are cached. Software-based solutions, conversely, must intercept and validate DMA operations through driver code execution, potentially introducing variable latency and CPU overhead.

Implementation

The main idea for this implementation comes from this blog from tandasat, and this PoC from cutecatsandvirtualmachines. We begin with the initialization of DMA remapping structures, which are discovered through ACPI tables - specifically the DMAR (DMA Remapping) table for Intel systems. The ProcessDmarTable function is responsible for parsing the DMAR ACPI table.

EFI_STATUS ProcessDmarTable(
    IN EFI_ACPI_DMAR_HEADER* DmarTable,
    IN OUT DMAR_UNIT_INFORMATION* DmarUnits,
    IN UINT64 MaxDmarUnitCount,
    OUT UINT64* DetectedUnitCount)

The function begins with a crucial security check using MmIsAddressValid to verify that the DMAR table pointer resides in valid memory space. This is essential because ACPI tables could potentially be tampered with, and accessing invalid memory could lead to system crashes or security vulnerabilities.

{
    if (!MmIsAddressValid(DmarTable)) {
        DbgMsg("[VT-d] DMAR table ptr is invalid: %p", DmarTable);
        return STATUS_NOT_MAPPED_DATA;
    }

The DMAR table traversal is implemented through pointer arithmetic and structure casting. Here, Add2Ptr calculates the end address of the DMAR table using the table's length field. The DmarTable + 1 operation skips past the main header to the first remapping structure. This arithmetic is safe because the DMAR table's length was validated by the ACPI subsystem during boot.

endOfDmar = (UINT64)Add2Ptr(DmarTable, DmarTable->Header.Length);
dmarHeader = (EFI_ACPI_DMAR_STRUCTURE_HEADER*)(DmarTable + 1);

The main processing loop identifies DMA Remapping Hardware Unit Definition (DRHD) structures by checking the type field. Each DRHD structure describes a remapping hardware unit capable of DMA transaction remapping. The type EFI_ACPI_DMAR_TYPE_DRHD specifically indicates a hardware definition structure, as opposed to other DMAR structure types like Reserved Memory Regions (RMRR) or Root Port ATS Capability (ATSR).

if (dmarHeader->Type == EFI_ACPI_DMAR_TYPE_DRHD)

For each discovered unit, the function reads two critical capability registers. The Capability Register (CAP_REG) contains fundamental features like supported address widths, caching requirements, and first-level translation support. The Extended Capability Register (ECAP_REG) describes advanced features like interrupt remapping, queued invalidation support, and page-walk coherency capabilities. These are read using memory-mapped I/O operations, as the registers reside in the chipset's PCI configuration space.

DmarUnits[discoveredUnitCount].Capability.Uint64 =
    CPU::MmIoRead<DWORD64>(DmarUnits[discoveredUnitCount].RegisterBasePa + R_CAP_REG);
DmarUnits[discoveredUnitCount].ExtendedCapability.Uint64 =
    CPU::MmIoRead<DWORD64>(DmarUnits[discoveredUnitCount].RegisterBasePa + R_ECAP_REG);

The function implements bounded discovery through MaxDmarUnitCount. This prevents buffer overflows in the DmarUnits array while allowing for systems with multiple remapping units. The zero initialization of DmarUnits using RtlZeroMemory ensures that any unused entries remain in a known state. We then zero out the register base address after reading to prevent potential reuse of the address by malicious code that might later access the DMAR table in memory.

if (discoveredUnitCount < MaxDmarUnitCount)
    {
       EFI_ACPI_DMAR_DRHD_HEADER* dmarUnit;

        dmarUnit = (EFI_ACPI_DMAR_DRHD_HEADER*)dmarHeader;
        DmarUnits[discoveredUnitCount].RegisterBasePa = dmarUnit->RegisterBaseAddress;

        DmarUnits[discoveredUnitCount].Capability.Uint64 =
            CPU::MmIoRead<DWORD64>(DmarUnits[discoveredUnitCount].RegisterBasePa + R_CAP_REG);
        DmarUnits[discoveredUnitCount].ExtendedCapability.Uint64 =
            CPU::MmIoRead<DWORD64>(DmarUnits[discoveredUnitCount].RegisterBasePa + R_ECAP_REG);

        dmarUnit->RegisterBaseAddress = 0;
    }

The function's error handling covers two critical cases:

Too many units discovered (discoveredUnitCount > MaxDmarUnitCount)
Invalid DMAR table pointer (initial check)

Each failure case returns a specific status code that allows the caller to handle the error appropriately.

STATUS_UNSUCCESSFUL: No units found

      if (discoveredUnitCount == 0)
      {
          DbgMsg("[VT-d] No DMA remapping hardware unit found");
          return STATUS_UNSUCCESSFUL;
      }

STATUS_RESOURCE_NOT_OWNED: Too many units

      if (discoveredUnitCount > MaxDmarUnitCount)
      {
          DbgMsg("[VT-d] Too many DMA remapping hardware units found (%llu)",
              discoveredUnitCount);
          return STATUS_RESOURCE_NOT_OWNED;
      }

The implementation establishes a four-level page table hierarchy for DMA address translation, mirroring the structure used by modern x86-64 CPU memory management. This hierarchy is initialized through the BuildPassthroughTranslations function, which creates an identity mapping for all PCI devices up to 512GB of physical memory. The function meticulously constructs root tables, context tables, and second-level page tables.

VOID BuildPassthroughTranslations(OUT DMAR_TRANSLATIONS* Translations)
{
    VTD_ROOT_ENTRY defaultRootValue;
    VTD_CONTEXT_ENTRY defaultContextValue;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pdpt;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pd;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pml4e;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pdpte;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pde;

The implementation supports granular memory protection through page splitting mechanisms. When fine-grained control is needed over a specific 4KB page within a 2MB large page, the Split2MbPage function dynamically splits the large page into 512 individual 4KB pages. This operation is crucial for implementing precise memory protection policies.

VTD_SECOND_LEVEL_PAGING_ENTRY* Split2MbPage(IN OUT VTD_SECOND_LEVEL_PAGING_ENTRY* PageDirectoryEntry) {
    pageTable = (VTD_SECOND_LEVEL_PAGING_ENTRY*)cpp::kMalloc(PAGE_SIZE);
    baseAddress = ((UINT64)PageDirectoryEntry->Bits.AddressLo << 12) |
        ((UINT64)PageDirectoryEntry->Bits.AddressHi << 32);

    for (UINT64 ptIndex = 0; ptIndex < 512; ++ptIndex) {
        pageTable[ptIndex].Uint64 = baseAddress;
        pageTable[ptIndex].Bits.Read = readable;
        pageTable[ptIndex].Bits.Write = writable;
        baseAddress += PAGE_SIZE;
    }
}

The function constructs a page table hierarchy consisting of:

Root Table: The highest level table containing 256 entries, one for each PCI bus number
Context Table: Referenced by root entries, containing device and function specific translations
Second-level page tables: PML4, PDPT (Page Directory Pointer Table), PD (Page Directory), and PT (Page Table) forming a four-level address translation structure

The root table initialization is particularly important:

defaultRootValue.Uint128.Uint64Hi = defaultRootValue.Uint128.Uint64Lo = 0;
UINT64 contextTable = (UINT64)Memory::VirtToPhy(Translations->ContextTable);
defaultRootValue.Bits.ContextTablePointerLo = (UINT32)(contextTable >> 12);
defaultRootValue.Bits.ContextTablePointerHi = (UINT32)(contextTable >> 32);
defaultRootValue.Bits.Present = TRUE;

Each root entry is a 128-bit structure that points to a context table. The physical address of the context table is split into high and low components because the hardware expects the address to be page-aligned (hence the right shift by 12). The Present bit indicates that the entry is valid and can be used by the hardware.

The context table setup demonstrates the configuration for 48-bit addressing. The AddressWidth field set to BIT1 (010b) specifically configures for 48-bit guest addresses, requiring four-level page tables. The DomainIdentifier provides isolation between different sets of remapping tables.

defaultContextValue.Bits.DomainIdentifier = 2;
defaultContextValue.Bits.AddressWidth = BIT1;  // 010b: 48-bit AGAW (4-level page table)
defaultContextValue.Bits.SecondLevelPageTranslationPointerLo = (UINT32)(Pml4 >> 12);
defaultContextValue.Bits.SecondLevelPageTranslationPointerHi = (UINT32)(Pml4 >> 32);
defaultContextValue.Bits.Present = TRUE;

The second-level page tables implement the actual memory mapping.

destinationPa = 0;
pml4Index = 0;
pdpt = Translations->SlPdpt[pml4Index];
pml4e = &Translations->SlPml4[pml4Index];
pml4e->Uint64 = (UINT64)Memory::VirtToPhy(pdpt);
pml4e->Bits.Read = TRUE;
pml4e->Bits.Write = TRUE;

The code uses 2MB large pages at the PD level to reduce the number of page tables needed.

pde = &pd[pdIndex];
pde->Uint64 = destinationPa;
pde->Bits.Read = TRUE;
pde->Bits.Write = TRUE;
pde->Bits.PageSize = TRUE;
destinationPa += SIZE_2MB;

The PageSize bit set to TRUE indicates a 2MB page rather than a reference to a page table of 4KB pages. This optimization significantly reduces memory overhead while still providing sufficient granularity for most DMA operations.

Cache coherency is maintained through explicit writeback operations. This is crucial because the IOMMU hardware reads these tables directly from memory, and any cached modifications must be written back to ensure the hardware sees the updated values.

CPU::WriteBackDataCacheRange(Translations, sizeof(*Translations));

The identity mapping is created by setting the destination physical address equal to the input address (destinationPa variable), effectively making DMA operations initially transparent to devices while still going through the remapping hardware. This allows for later modification of the mappings to implement protection or isolation without requiring device driver changes.

The entire structure supports mapping up to 512GB of physical memory (one PML4 entry × 512 PDPT entries × 512 PD entries × 2MB per page), which is sufficient for most systems while keeping the page table structure manageable. The use of large pages significantly reduces the memory overhead and translation latency compared to using 4KB pages throughout the hierarchy.

ChangePermissionOfPageForAllDevices then orchestrates fine-grained DMA access control by manipulating the hardware's page table entries. The function's sophisticated implementation allows for atomic permission modifications while maintaining system stability and security guarantees. At its core, it operates on the Intel VT-d second-level page table structure, which provides hardware-enforced DMA access control at a 4KB page granularity.

EFI_STATUS ChangePermissionOfPageForAllDevices(
    IN OUT DMAR_TRANSLATIONS* Translations,
    IN UINT64 Address,
    IN BOOLEAN AllowReadWrite,
    OUT VTD_SECOND_LEVEL_PAGING_ENTRY** AllocatedPageTable)
{
    PHYSICAL_ADDRESS pa = { 0 };
    EFI_STATUS status;
    ADDRESS_TRANSLATION_HELPER helper;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pde;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pt;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pte;

At the core of VT-d's permission management system lies the translation helper structure, which provides a precise mechanism for breaking down physical addresses into their constituent page table indices.

The function employs a critical address translation mechanism through the ADDRESS_TRANSLATION_HELPER union structure, which provides a binary-compatible overlay that decomposes a 48-bit physical address into its constituent page table indices. The decomposition splits the address bits into PML4 (bits 47-39), PDPT (bits 38-30), PD (bits 29-21), and PT (bits 20-12) indices, with the remaining bits (11-0) representing the page offset. This decomposition is crucial for traversing the page table hierarchy efficiently and accurately.

typedef union _ADDRESS_TRANSLATION_HELPER {
    UINT64 AsUInt64;
    struct {
        UINT64 Offset : 12;    // Bits 0-11: Page offset within 4KB
        UINT64 Pt : 9;         // Bits 12-20: Page Table index
        UINT64 Pd : 9;         // Bits 21-29: Page Directory index
        UINT64 Pdpt : 9;       // Bits 30-38: Page Directory Pointer Table index
        UINT64 Pml4 : 9;       // Bits 39-47: PML4 index
        UINT64 Reserved : 16;   // Bits 48-63: Must be zero for valid addresses
    } AsIndex;
} ADDRESS_TRANSLATION_HELPER;

Memory management safety is enforced through rigorous boundary checking. The function validates that the target address falls within the supported 48-bit physical address space by examining the PML4 index. Any address with a non-zero PML4 index exceeds the maximum supported physical address range, triggering an immediate failure with STATUS_UNSUCCESSFUL. This validation prevents potential security vulnerabilities that could arise from accessing out-of-bounds memory regions.

The function handles the complex scenario of 2MB large pages through a sophisticated page splitting mechanism. When encountering a page directory entry (PDE) marked with PageSize=TRUE (indicating a 2MB page), it invokes the Split2MbPage function. This operation atomically converts a single 2MB mapping into 512 individual 4KB page mappings, maintaining the original permissions while enabling fine-grained control. The splitting operation must carefully manage memory allocation, permission inheritance, and cache coherency to prevent any temporal vulnerabilities during the transition.

The permission modification process involves intricate physical memory manipulation. The function reconstructs the physical address of the target page table by combining the split address fields (AddressHi and AddressLo) from the page directory entry. This address is then temporarily mapped into the kernel's virtual address space using MmMapIoSpaceEx with specific caching attributes (PAGE_READWRITE | PAGE_NOCACHE) to ensure direct hardware access, which prevents stale cache lines from interfering with IOMMU operations.

We then need to write back modified page table entries using CPU::WriteBackDataCacheRange to ensure that all CPU caches are flushed to memory before the IOMMU hardware accesses the modified entries. Without proper cache management, there exists a race condition where the IOMMU might use stale permissions from its internal caches or encounter inconsistent memory state due to un-written CPU cache lines.

The real-time permission modification occurs through direct manipulation of the page table entry's permission bits. The function updates both Read and Write bits atomically based on the AllowReadWrite parameter. When these bits are cleared, the IOMMU hardware will actively block any DMA operations targeting the corresponding 4KB page, raising a remapping fault that can be logged and handled by the system software. This hardware-enforced blocking occurs without any runtime overhead once the permissions are set.

// Locate the PDE for our target address
    pde = &Translations->SlPd[helper.AsIndex.Pml4][helper.AsIndex.Pdpt][helper.AsIndex.Pd];

    // If this is a 2MB page, split it into 4KB pages
    if (pde->Bits.PageSize != FALSE) {
        *AllocatedPageTable = Split2MbPage(pde);
        if (*AllocatedPageTable == NULL) {
            status = STATUS_RESOURCE_NOT_OWNED;
            goto Exit;
        }
    }

    // Get the physical address of the page table
    pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)(((UINT64)pde->Bits.AddressLo << 12) |
        ((UINT64)pde->Bits.AddressHi << 32));

    // Map the page table into kernel virtual address space
    pa.QuadPart = (ULONGLONG)pt;
    pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)MmMapIoSpaceEx(
        pa, 
        PAGE_SIZE, 
        PAGE_READWRITE | PAGE_NOCACHE
    );

    // Update the specific PTE's permissions
    pte = &pt[helper.AsIndex.Pt];
    pte->Bits.Read = AllowReadWrite;
    pte->Bits.Write = AllowReadWrite;

    // Ensure changes are written to memory
    CPU::WriteBackDataCacheRange(pte, sizeof(*pte));

The IOMMU hardware maintains its own Translation Lookaside Buffer (IOTLB) which caches address translations and permissions for performance optimization., but when page table entries are modified, these cached values must be explicitly invalidated to ensure the new permissions take immediate effect. The invalidation process is carried out through specific MMIO registers in the IOMMU, particularly the IOTLB Invalidate Register. This register not only triggers the invalidation but also provides granular control over the scope and type of invalidation performed. The invalidation command includes flags for draining both read (DR) and write (DW) requests, ensuring that all in-flight DMA operations complete before the new permissions take effect.

The IIRG (Invalidation Request Granularity) field allows for selective invalidation targeting specific domains or global invalidation of all entries. Furthermore, the hardware implements a handshake mechanism where software must wait for the IVT (Invalidate IOTLB) bit to clear, indicating completion of the invalidation process, before proceeding. This synchronization is crucial because without it, there would be a race condition where DMA operations might still use cached permissions from the old page table entries, potentially bypassing the intended access restrictions.

typedef struct _VTD_IOTLB_INVALIDATE_REG {
    union {
        struct {
            UINT64 Reserved1 : 32;           // Reserved bits
            UINT64 Domain_Id : 16;           // Domain ID for selective invalidation
            UINT64 IIRG : 2;                 // Invalidation granularity
            UINT64 Reserved2 : 2;            // More reserved bits
            UINT64 DR : 1;                   // Drain Reads
            UINT64 DW : 1;                   // Drain Writes
            UINT64 Reserved3 : 9;            // Additional reserved bits
            UINT64 IVT : 1;                  // Invalidate IOTLB
        } Bits;
        UINT64 Uint64;
    };
} VTD_IOTLB_INVALIDATE_REG, *PVTD_IOTLB_INVALIDATE_REG;

// Invalidate IOTLB and wait for completion
reg.InvalidateIoTlb(B_IOTLB_REG_IVT | V_IOTLB_REG_IIRG_GLOBAL | V_IOTLB_REG_DR | V_IOTLB_REG_DW);
while ((CPU::MmIoRead<UINT32>(reg.RegisterBase + R_IOTLB_REG) & B_IOTLB_REG_IVT) != 0) {
    _mm_pause();  // Wait for invalidation to complete
}

After ChangePermissionOfPageForAllDevices controls the read/write access rights through the page table entry's permission bits, ChangePointerOfPageForAllDevices extends this protection by actually redirecting the physical address translation in the IOMMU page tables.

ChangePointerOfPageForAllDevices implements DMA remapping by modifying the physical address translation within the IOMMU page tables, which allows for transparent redirection of DMA operations from one physical page to another.

EFI_STATUS ChangePointerOfPageForAllDevices(
    IN OUT DMAR_TRANSLATIONS* Translations,
    IN UINT64 Address,
    IN UINT64 SubstituteAddress,
    OUT VTD_SECOND_LEVEL_PAGING_ENTRY** AllocatedPageTable)
{
    PHYSICAL_ADDRESS pa = { 0 };
    EFI_STATUS status;
    ADDRESS_TRANSLATION_HELPER helper;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pde;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pt;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pte;

The address validation mechanism implements a dual-layer verification process. It first checks if the provided address is a valid virtual address using MmIsAddressValid, and if so, performs the virtual-to-physical translation through Memory::VirtToPhy. This flexibility allows the function to handle both virtual and physical addresses seamlessly while maintaining system security. The ADDRESS_TRANSLATION_HELPER union then decomposes the resulting physical address into its constituent page table indices.

if (MmIsAddressValid((PVOID)Address))
    Address = Memory::VirtToPhy((PVOID)Address);

helper.AsUInt64 = Address;
DbgMsg("[VT-d] Target 0x%llx at pml4: 0x%llx, pdpt: 0x%llx, pdt: 0x%llx, pt: 0x%llx",
    helper.AsUInt64,
    helper.AsIndex.Pml4, helper.AsIndex.Pdpt, helper.AsIndex.Pd, helper.AsIndex.Pt);

The page table manipulation process begins with locating the appropriate Page Directory Entry (PDE) through the calculated indices. When encountering a 2MB large page, indicated by the PageSize bit in the PDE, the function invokes the specialized Split2MbPage operation. This complex procedure allocates a new page table and redistributes the large page mapping into 512 individual 4KB page entries, maintaining consistency throughout the transition.

pde = &Translations->SlPd[helper.AsIndex.Pml4][helper.AsIndex.Pdpt][helper.AsIndex.Pd];
if (pde->Bits.PageSize != FALSE)
{
    *AllocatedPageTable = Split2MbPage(pde);
    if (*AllocatedPageTable == NULL)
    {
        status = STATUS_RESOURCE_NOT_OWNED;
        goto Exit;
    }
}

The actual address remapping reconstructs the physical address of the page table from the PDE's split address fields, maps it into the kernel's address space using MmMapIoSpaceEx with specific caching attributes, and modifies the target PTE. The substitute address undergoes similar virtual-to-physical translation if necessary, and its lower and upper portions are stored in the PTE's address fields.

pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)(((UINT64)pde->Bits.AddressLo << 12) |
    ((UINT64)pde->Bits.AddressHi << 32));
pa.QuadPart = (ULONGLONG)pt;
pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)MmMapIoSpaceEx(pa, PAGE_SIZE, PAGE_READWRITE | PAGE_NOCACHE);
pte = &pt[helper.AsIndex.Pt];

if (MmIsAddressValid((PVOID)SubstituteAddress))
    SubstituteAddress = Memory::VirtToPhy((PVOID)SubstituteAddress);

pte->Bits.AddressLo = SubstituteAddress >> 12;
pte->Bits.AddressHi = SubstituteAddress >> 32;

While some anticheat vendors have used interesting methods to limit attacker visibility into game-related memory regions, such as using CPU paging-based guarded regions (which Vanguard famously already implements), these methods can easily be bypassed using tools like the MemProcFS. But this method is really a proof of concept shitpost rather than anything that could actually be deployed live today in a production environment, alot of driver interactions need to be taken into account before one could truly implement this. But it would be nice if anticheats can stop expecting the worse out of people who want to play fairly.