Cover Illustration by ochxzuke

I was heading for a trip over the weekends where i would be far from my main workstation, and at this time i was looking to continue my current run of Persona 5 Royal on the PC. Recently with the launch of MacOS Sonoma, Apple added support for the translation of AVX2 instructions through Game Porting Toolkit 2.

But quickly i found that while there are translations for AVX2 instructions, they’re not very good. This is because many of the AVX2 instructions were mapped to ARM NEON equivalents, and as NEON has relatively limited and less flexible instruction set compared to its x86 counterparts like Intel’s AVX2. NEON lacks certain advanced operations, such as comprehensive shuffle and permutation instructions, which are readily available in AVX.

While Rosetta emulates other Intel SIMD technologies like SSE super fast, it skips over support for AVX because the patent for AVX is still valid, which Intel is very eager to remind its competitors about. Emulating the Intel ISA without authorization will likely land Apple into legal trouble, which is why they would likely need to license the ISA in order to incorporate it to Rosetta.

This is why workarounds often involves translating one AVX instruction multi-step NEON instructions to achieve what AVX can accomplish with a single operation, creating bottlenecks in translation speeds. This is where i fell into a rabbit hole of ARM NEON performance in MacOS and also the mysterious Apple Matrix Extensions (AMX) coprocessor.

Brief history of AMX

In 2019 at Apple’s Fall Keynote to introduce new iPads, Apple Watches, and iPhones, there was a brief 30 second moment where they went on-stage and introduced a new co-processor inside the A13 Bionic chip. What was vaguely called “Machine Learning Accelerators” added two SIMD units that performs accelerated matrix operation into the CPU, which Apple claimed to have increased the matrix multiplication power of A13 Bionic by 6 times and allowing the CPU to achieve one trillion operations per second.

Many thought that this was an addition to the Apple Neural Engine (ANE) Accelerators previously introduced in the A11 Bionic, but it seemed like these dedicated accelerators are seperate from the ANE.

The Apple Matrix Coprocessor (AMX) operates as a specialized coprocessor that interfaces directly with the instruction stream rather than functioning as a discrete accelerator. Unlike traditional accelerators such as GPUs or Apple's Neural Engine, AMX implements instruction stream monitoring capabilities, allowing it to intercept and process specific matrix operations embedded within the standard instruction flow.

The AMX's implementation diverges from conventional accelerator designs by eliminating the need for explicit memory management and instruction queue population typically associated with GPU-style accelerators. This design choice significantly reduces operational overhead, particularly beneficial for smaller matrix computations where traditional accelerator setup costs would be prohibitive. The coprocessor achieves this efficiency by monitoring the instruction cache feed to the CPU cores, identifying and intercepting matrix-specific instructions for specialized processing.

From an instruction set architecture (ISA) perspective, the AMX represents a non-standard extension to ARM's base instruction set. This implementation required special dispensation from ARM Ltd., which historically restricted custom instruction set extensions until their 2019 policy revision. The AMX instructions are interleaved with standard ARM instructions but remain undocumented in official specifications, accessible primarily through Apple's Accelerate framework components including vImage, BLAS, BNNS, vDSP, and LAPACK.

Performance analysis conducted by Nod Labs (now acquired by AMD) demonstrates that AMX achieves approximately twice the computational throughput compared to ARM's native NEON SIMD instructions for matrix operations. This performance differential is particularly significant for machine learning workloads and high-performance computing applications that heavily utilize matrix computations. The coprocessor's efficiency stems from its tight integration with the CPU's instruction pipeline, enabling lower-latency operation compared to discrete accelerators.

Technicals

These cores are more commonly known as the AMX, which are undocumented arm64 ISA extensions present on Apple Silicon chips. In the M-series chips, the class of Apple Silicon processors for desktop use, the amount of AMX cores were bumped from 2 to 4. The existence of these instructions have been reversed by Dougal Johnson.

Thanks to abandoned patent US20180074824A1 from Apple we can see that the AMX instructions operate on a 32x32 grid of compute units. Each unit is capable of performing 16-bit multiply-accumulate operations. The architecture allows for flexibility in data width, as 2x2 subgrids of units can perform 32-bit multiply-accumulate operations, and 4x4 subgrids can handle 64-bit multiply-accumulate operations. To feed this computational grid, AMX utilizes a pool of X registers and a pool of Y registers. Each of these registers contains 32 16-bit elements, 16 32-bit elements, or 8 64-bit elements. This structure enables a single instruction to perform a full outer product operation, multiplying every element of an X register with every element of a Y register and accumulating the results with the corresponding Z element.

The AMX architecture supports various data types for computation. It can handle IEEE754 floating-point numbers in f16, f32, or f64 formats, with the same width used for all three operands in fused-multiply-add operations. Additionally, it supports operations using f16 multiplicands while accumulating onto f32. Integer operations are also supported, with 8-bit or 16-bit multiplicands accumulating onto 16 or 32 bits in various signedness configurations. The M2 hardware introduced support for bfloat16 (bf16) multiplicands, which can accumulate onto either bf16 or IEEE754 f32.

While the A-series chips contained version 1 (marked by the existence of 7-bit writemasks), the M1 chip is believed to contain version 2 of the AMX instructions that contained 9-bit writemasks instead. The transition from M1 to M2 brought bf16 support along with other minor adjustments. The M2 to M3 transition further expanded capabilities by adding an extra mode to each of the ldx, ldy, and matint instructions.

The AMX coprocessor's state comprises two 0x200 byte registers, amx0 ("x") and amx1 ("y"), and one 0x1000 byte register amx2 ("z"). Apple's documentation describes x, y, and z as register groups, with each 64-byte row considered a "register". The z register group is specifically described as "64 registers in an M-by-N matrix". Additionally, the AMX configuration happens at the level of the AMX_CONFIG_EL1/EL12/EL2/EL21 registers, with AMX_STATE_T_EL1 and AMX_CONTEXT_EL1 being also present.

AMX instructions follow a specific format:

0x00201000 | ((op & 0x1F) << 5) | (operand & 0x1F)

The coprocessor must be explicitly enabled using op=17, operand=0 and disabled using op=17, operand=1. In the Accelerate framework, these instructions are consistently prefixed by three NOPs. Executing non-enable instructions when AMX is disabled results in illegal instruction exceptions.

Most AMX operations (op=0-16 and op=18-22) use a 64-bit register number (X0-X30 or 31=XZR) as the operand. This register typically contains a bitfield with additional operation parameters. For instance, load and store operations use a 56-bit address in bits 0-55, a 5-bit register offset (in 0x40-byte units) in bits 56-61, and a flag in bit 62 (0 for 0x40-byte operations, 1 for 0x80-byte aligned operations).

The AMX instruction set includes various operations:

Load/store operations (ops 0-7)
Extract and move operations (ops 8-9)
Floating-point multiply-add operations (ops 10-13, 15-16)
Integer multiply-add operations (op 14)
Vector and matrix operations (ops 18-21)
Lookup table generation (op 22)

The AMX instruction set includes LDX/LDY for loading data, FMA for multiply-accumulate operations, and LDZ/SDZ for loading and storing results. The FMA instruction operates in two modes: Matrix Mode for outer product computations and Vector Mode for inner product calculations. The outer product mode offers higher hardware parallelism compared to the inner product mode.

The z register group occupies 0x1000 bytes (4096 bytes) of state. This larger group is divided into 64 registers, each also being 64 bytes wide. The z group is primarily used to store the results of operations performed on the x and y registers. This arrangement means that outer product results are not stored contiguously in registers, which has implications for how data is accessed and processed.

AMX's performance characteristics are noteworthy. It can achieve different levels of parallelism depending on how many Arithmetic Logic Units (ALUs) are enabled. Tests show that enabling all 4 ALUs can reach up to 2k GFLOP/s, demonstrating that ALUs can execute in parallel even when configured and emitted individually.

Load performance varies based on factors such as whether data is loaded into X and Y registers simultaneously, whether memory accesses are consecutive, and how many registers are used for reading. The highest bandwidth (219.201 GB/s) was achieved when loading 4 registers each from X and Y simultaneously with consecutive memory access.

Store performance is generally slower than compute and load operations, indicating that frequent loading and storing of Z registers should be avoided for optimal performance. The maximum bandwidth achieved for store operations was around 13 GB/s.

When designing kernels to leverage AMX, it's crucial to balance computation and data loading. One effective strategy involves loading 2 groups each of M and N data, allowing for 8 computation cycles that maximize ALU utilization. This approach allows for streaming computation, achieving near-peak performance.

Performance-wise, AMX functions as a non-speculative coprocessor, with operations posted via the CPU cores' store units. The M1 chip features two AMX coprocessors: one for the four Firestorm cores and another for the four Icestorm cores. Each coprocessor maintains four copies of the architectural register state, one per core. They access memory through the same L2 cache as the cores and operate at similar clock speeds.

The performance variant of AMX consists of an array of 4-cycle latency, pipelined FMAs, achieving a throughput of one FMA32 or FMA64 instruction per cycle, but only one FMA16 instruction every two cycles. The efficiency variant maintains the 4-cycle FMA32/FMA64 latency but performs one FMA32 or FMA64 instruction every four cycles, or one FMA16 instruction every eight cycles.

To achieve 1-cycle throughput from a single core, destinations must be independent (using a Z offset). Operations utilizing too much of the Z register will experience lower throughput. Throughput can be improved by distributing operations across different cores, thus leveraging entirely different Z registers.

Performance Characteristics

We can compare the performance between SIMD operations performed by Arm NEON and Apple’s AMX by testing a few different strategies, first using OpenBLAS, then using Apple AMX directly, and through the official method using Apple’s Accelerate library. These tests were run in a Macbook Pro 14 inch with the M1 Pro chip and 16 GB of RAM.

OpenBLAS (Arm NEON) Performance

For a baseline comparison we can use the OpenBLAS, which is a repo of BLAS and LAPACK APIs with many optimizations for specific processor types, including Arm NEON and SVE (of note, Apple Silicon up to M3 haven’t supported the Arm SVE ISA as it still stuck with ARMv8 ISA).

#pragma once
#include <cblas.h>
#include <vector>
#include <algorithm>

template <typename T>
class OpenBLASGEMM : public GEMM<T>
{
private:
    size_t n;
    std::vector<T> a, b, c;

public:
    explicit OpenBLASGEMM(size_t n) : n(n), a(n * n), b(n * n), c(n * n) {}

    void run() override
    {
        if constexpr (std::is_same_v<T, float>) {
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0f,
                        a.data(), n, b.data(), n, 0.0f, c.data(), n);
        } else if constexpr (std::is_same_v<T, double>) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0,
                        a.data(), n, b.data(), n, 0.0, c.data(), n);
        }
    }

    void init_matrices() override
    {
        std::fill(a.begin(), a.end(), T(1));
        std::fill(b.begin(), b.end(), T(1));
        std::fill(c.begin(), c.end(), T(0));
    }
};

N	GFLOP/s	mean(rt)	min(rt)	max(rt)
64	79.643	0.00001	0.00001	0.00001
128	63.995	0.00040	0.00007	0.00599
256	212.650	0.00087	0.00016	0.00705
512	371.408	0.00087	0.00072	0.00130
1024	423.668	0.00759	0.00507	0.02126
2048	448.801	0.04301	0.03828	0.04739
4096	570.152	0.27678	0.24106	0.32018
8192	574.347	2.22609	1.91437	2.91120

OpenBLAS demonstrates varying performance characteristics across different matrix dimensions. For compact matrices, specifically at N=64, OpenBLAS achieves 79.6 GFLOP/s. This performance gradually improves as matrix sizes increase, reaching 371 GFLOP/s at N=512. The library's response times for these smaller matrices, while relatively brief, show a noticeable increase starting from N=128. As matrix dimensions grow, OpenBLAS exhibits improved performance metrics. At N=1024, for instance, OpenBLAS reaches 423 GFLOP/s, marking a substantial improvement from smaller sizes.

The performance characteristics of OpenBLAS undergo further changes as matrix sizes grow beyond N=1024. In this range, the library exhibits longer response times, with this trend becoming particularly pronounced for large matrices such as N=4096 and N=8192. At these dimensions, OpenBLAS's mean response times increase significantly. The largest tested matrix size, N=8192, sees OpenBLAS achieving 574 GFLOP/s, but comes at the cost of increased computational time, with mean response times surpassing 2.2 seconds.

Rawdogging Apple AMX

Now to something more exotic, we interact with Apple AMX directly through a header file built by Dougall Johnson. We gonna have to call alot of instructions relating to AMX from this wrapper, but

We begin by making a microkernel with a 32x32 tile of the output matrix C, leveraging the AMX's ability to work with large data blocks efficiently. This loop loads four 16x32 blocks of C into the Z registers. The i << 2 in the offset argument selects different groups of Z registers for each block, effectively utilizing the full 64 Z registers available in the AMX.

for (int i = 0; i < 4; i++) {
    amx_ldz((uint8_t *)(C + i * 16 * ldc), i << 2, 1);
}

The main computation loop iterates over the K dimension, loading a column of A and a row of B in each iteration. These instructions load 32 elements from A into a Y register and 32 elements from B into an X register. The AMX can then perform an outer product of these vectors, accumulating the result into the Z registers.

amx_ldy((uint8_t *)(A + k), 0, 1);
amx_ldx((uint8_t *)(B + k * ldb), 0, 1);

Next we create a nested loop performs four FMA32 operations, each computing a 16x16 block of the 32x32 output tile. The arguments to amx_fma32 select different portions of the X and Y registers and different groups of Z registers for each operation.

for (int i = 0; i < 2; i++) {
    for (int j = 0; j < 2; j++) {
        amx_fma32(j, i, i * 2 + j, 0);
    }
}

To maximize performance, we can further create matrix packing strategies to rearrange the input matrices into a more cache-friendly layout. We can do this by packing 32-element column segments of A into contiguous memory, improving spatial locality and reducing cache misses during the micro-kernel execution.

static void pack_matrix_a(float *dest, const float *src, uint64_t M, uint64_t K, uint64_t lda) {
    for (uint64_t i = 0; i < M; i += 32) {
        for (uint64_t k = 0; k < K; k++) {
            for (uint64_t ii = 0; ii < 32 && i + ii < M; ii++) {
                dest[k * 32 + ii] = src[(i + ii) * lda + k];
            }
        }
        dest += K * 32;
    }
}

The main amx_sgemm function orchestrates the overall computation, employing a cache blocking strategy to optimize performance. It defines block sizes MC, KC, and NC (all set to 384 in this implementation) to ensure that the working set fits within the processor's cache hierarchy. The function allocates aligned memory for packed versions of A and B, ensuring optimal memory access patterns for the AMX instructions.

void amx_sgemm(float *A, float *B, float *C, const uint64_t size) {
    const uint64_t M = size, N = size, K = size;
    const uint64_t MC = 384, KC = 384, NC = 384;  // Cache blocking parameters

    float *packed_A = (float *)aligned_alloc(64, MC * KC * sizeof(float));
    float *packed_B = (float *)aligned_alloc(64, KC * NC * sizeof(float));

The computation is then structured as nested loops over blocks of the matrices, with the innermost loop invoking the micro-kernel.

for (uint64_t mc = 0; mc < M; mc += MC) {
    uint64_t mc_end = MIN(mc + MC, M);
    for (uint64_t nc = 0; nc < N; nc += NC) {
        uint64_t nc_end = MIN(nc + NC, N);

        // Pack B panel
        pack_matrix_b(packed_B, B + nc, K, nc_end - nc, N);

        for (uint64_t kc = 0; kc < K; kc += KC) {
            uint64_t kc_end = MIN(kc + KC, K);

            // Pack A panel
            pack_matrix_a(packed_A, A + mc * K + kc, mc_end - mc, kc_end - kc, K);

            // Micro-kernel calls
            for (uint64_t m = 0; m < mc_end - mc; m += 32) {
                for (uint64_t n = 0; n < nc_end - nc; n += 32) {
                    amx_gemm_micro_kernel(
                        packed_A + m * (kc_end - kc),
                        packed_B + n * (kc_end - kc),
                        C + (mc + m) * N + (nc + n),
                        kc_end - kc, 32, 32, N
                    );
                }
            }
        }
    }
}

Finally we arrive at the final implementation.

#pragma once

#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include "amx.h"

// AMX wrapper functions
static inline void amx_ldz(const uint8_t *addr, uint64_t offset, uint64_t mode)
{
    AMX_LDZ(((mode & 1ull) << 62) |
            ((offset & ((1ull << 6) - 1)) << 56) |
            (((uint64_t)addr & ((1ull << 56) - 1))));
}

static inline void amx_stz(uint8_t *addr, uint64_t offset, uint64_t mode)
{
    AMX_STZ(((mode & 1ull) << 62) |
            ((offset & ((1ull << 6) - 1)) << 56) |
            (((uint64_t)addr & ((1ull << 56) - 1))));
}

static inline void amx_ldx(uint8_t *addr, uint64_t offset, uint64_t mode)
{
    AMX_LDX(((mode & 1ull) << 62) |
            ((offset & ((1ull << 3) - 1)) << 56) |
            (((uint64_t)addr & ((1ull << 56) - 1))));
}

static inline void amx_ldy(uint8_t *addr, uint64_t offset, uint64_t mode)
{
    AMX_LDY(((mode & 1ull) << 62) |
            ((offset & ((1ull << 3) - 1)) << 56) |
            (((uint64_t)addr & ((1ull << 56) - 1))));
}

static inline void amx_stx(uint8_t *addr, uint64_t offset, uint64_t mode)
{
    AMX_STX(((mode & 1ull) << 62) |
            ((offset & ((1ull << 3) - 1)) << 56) |
            (((uint64_t)addr & ((1ull << 56) - 1))));
}

static inline void amx_sty(uint8_t *addr, uint64_t offset, uint64_t mode)
{
    AMX_STY(((mode & 1ull) << 62) |
            ((offset & ((1ull << 3) - 1)) << 56) |
            (((uint64_t)addr & ((1ull << 56) - 1))));
}

static inline void amx_fma32(uint64_t xoffset, uint64_t yoffset, uint64_t zoffset, uint64_t zignore)
{
    AMX_FMA32((yoffset << 6) |
              (xoffset << 6 << 10) |
              (zoffset << 20) |
              (zignore << 27));
}

#define MIN(a,b) ((a) < (b) ? (a) : (b))

// Micro-kernel for 32x32 tiles
static void amx_gemm_micro_kernel(const float *A, const float *B, float *C, 
                                  uint64_t K, uint64_t lda, uint64_t ldb, uint64_t ldc) {
    // Load C tile
    for (int i = 0; i < 4; i++) {
        amx_ldz((uint8_t *)(C + i * 16 * ldc), i << 2, 1);
    }

    for (uint64_t k = 0; k < K; k++) {
        // Load A column (32x1)
        amx_ldy((uint8_t *)(A + k), 0, 1);

        // Load B row (1x32)
        amx_ldx((uint8_t *)(B + k * ldb), 0, 1);

        // Perform FMA for 32x32 tile
        for (int i = 0; i < 2; i++) {
            for (int j = 0; j < 2; j++) {
                amx_fma32(j, i, i * 2 + j, 0);
            }
        }
    }

    // Store C tile
    for (int i = 0; i < 4; i++) {
        amx_stz((uint8_t *)(C + i * 16 * ldc), i << 2, 1);
    }
}

static void pack_matrix_a(float *dest, const float *src, uint64_t M, uint64_t K, uint64_t lda) {
    for (uint64_t i = 0; i < M; i += 32) {
        for (uint64_t k = 0; k < K; k++) {
            for (uint64_t ii = 0; ii < 32 && i + ii < M; ii++) {
                dest[k * 32 + ii] = src[(i + ii) * lda + k];
            }
        }
        dest += K * 32;
    }
}

static void pack_matrix_b(float *dest, const float *src, uint64_t K, uint64_t N, uint64_t ldb) {
    for (uint64_t j = 0; j < N; j += 32) {
        for (uint64_t k = 0; k < K; k++) {
            for (uint64_t jj = 0; jj < 32 && j + jj < N; jj++) {
                dest[k * 32 + jj] = src[k * ldb + (j + jj)];
            }
        }
        dest += K * 32;
    }
}

void amx_sgemm(float *A, float *B, float *C, const uint64_t size) {
    const uint64_t M = size, N = size, K = size;
    const uint64_t MC = 384, KC = 384, NC = 384;  // Cache blocking parameters

    // Allocate packed buffers
    float *packed_A = (float *)aligned_alloc(64, MC * KC * sizeof(float));
    float *packed_B = (float *)aligned_alloc(64, KC * NC * sizeof(float));

    if (!packed_A || !packed_B) {
        free(packed_A);
        free(packed_B);
        return;
    }

    AMX_START();

    for (uint64_t mc = 0; mc < M; mc += MC) {
        uint64_t mc_end = MIN(mc + MC, M);
        for (uint64_t nc = 0; nc < N; nc += NC) {
            uint64_t nc_end = MIN(nc + NC, N);

            // Pack B panel
            pack_matrix_b(packed_B, B + nc, K, nc_end - nc, N);

            for (uint64_t kc = 0; kc < K; kc += KC) {
                uint64_t kc_end = MIN(kc + KC, K);

                // Pack A panel
                pack_matrix_a(packed_A, A + mc * K + kc, mc_end - mc, kc_end - kc, K);

                // Micro-kernel calls
                for (uint64_t m = 0; m < mc_end - mc; m += 32) {
                    for (uint64_t n = 0; n < nc_end - nc; n += 32) {
                        amx_gemm_micro_kernel(
                            packed_A + m * (kc_end - kc),
                            packed_B + n * (kc_end - kc),
                            C + (mc + m) * N + (nc + n),
                            kc_end - kc, 32, 32, N
                        );
                    }
                }
            }
        }
    }

    AMX_STOP();

    free(packed_A);
    free(packed_B);
}

N	GFLOP/s	mean(rt)	min(rt)	max(rt)
64	636.304	0.00000	0.00000	0.00000
128	1006.633	0.00000	0.00000	0.00000
256	1182.901	0.00003	0.00003	0.00003
512	2178.736	0.00014	0.00012	0.00024
1024	2365.431	0.00090	0.00083	0.00108
2048	1737.817	0.01026	0.00990	0.01087
4096	1654.705	0.08087	0.07996	0.08243
8192	1566.844	0.71723	0.68770	0.72344

At the smallest matrix size (N = 64), the GFLOP/s is 636.304, but the response times (mean, min, max) are effectively zero, indicating that the operations are completed extremely quickly, likely within the hardware’s noise level or precision limits of measurement. As the matrix size increases, the GFLOP/s gradually increases, reaching a peak of 2365.431 GFLOP/s at N = 1024. This shows that the AMX hardware is optimized to handle matrix sizes around 1024, providing the highest throughput at this size. However, as the matrix size continues to increase beyond 1024, the GFLOP/s begins to decline, likely due to limitations in memory bandwidth or cache usage that cannot keep pace with the increased demand for data.

The response times also tell a story of how performance scales with matrix size. For small matrices, the response time is negligible, but it becomes progressively more significant as matrix size increases. For example, at N = 2048, the mean response time jumps to 0.01026 seconds, and at N = 4096, the response time escalates further to 0.08087 seconds. The trend continues, with N = 8192 showing the highest response time at 0.71723 seconds.

Interacting with Apple Accelerate

Apple’s Accelerate is a library for high-performance but energy-efficient computation on the CPU by leveraging its vector-processing capability. The library contains functions for :

Image processing, such as converting between formats and image manipulation (vImage)
Low-level routines for accelerating common linear algebra operations such as matricies and vectors (BLAS)
Specific type of neural networks trained to be capable to quantify uncertainty associated with the underlying processes (BNNS)
Mathematical operations important in image processing or any signal really including audio (vDSP)
Routines for solving higher level linear algebra functions like linear equations or eigenvalue problems (LAPACK)

In more common HPC circles, this would be equivalent to something like cuBLAS or cuLAPACK from NVIDIA. We can use this library by using <Accelerate/Accelerate.h>

#pragma once
#include <Accelerate/Accelerate.h>
#include <vector>
#include <algorithm>

template <typename T>
class AccelerateGEMM : public GEMM<T>
{
private:
    size_t n;
    std::vector<T> a, b, c;

public:
    explicit AccelerateGEMM(size_t n) : n(n), a(n * n), b(n * n), c(n * n) {}

    void run() override
    {
        if constexpr (std::is_same_v<T, float>) {
            cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0f,
                        a.data(), n, b.data(), n, 0.0f, c.data(), n);
        } else if constexpr (std::is_same_v<T, double>) {
            cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0,
                        a.data(), n, b.data(), n, 0.0, c.data(), n);
        }
    }

    void init_matrices() override
    {
        std::fill(a.begin(), a.end(), T(1));
        std::fill(b.begin(), b.end(), T(1));
        std::fill(c.begin(), c.end(), T(0));
    }
};

N	GFLOP/s	mean(rt)	min(rt)	max(rt)
64	699.051	0.00000	0.00000	0.00000
128	1037.937	0.00000	0.00000	0.00000
256	1233.211	0.00003	0.00003	0.00003
512	2274.877	0.00014	0.00012	0.00023
1024	2483.596	0.00088	0.00086	0.00093
2048	1787.963	0.01019	0.00961	0.01196
4096	1727.521	0.08243	0.07956	0.11510
8192	1628.371	0.70349	0.67522	0.77956

The Accelerate framework vastly outperforms OpenBLAS for smaller matrix sizes. For instance, at N=64, Accelerate achieves 699 GFLOP/s, whereas OpenBLAS only achieves 79.6 GFLOP/s. The difference is consistent up to N=512, where Accelerate records 2274 GFLOP/s compared to 371 GFLOP/s from OpenBLAS. The response times (mean, min, max) for Accelerate are near-zero for smaller matrix sizes, showing that AMX instructions are highly efficient in handling small matrix operations. In contrast, OpenBLAS has slightly larger response times, especially for N=128 and beyond.

At larger matrix sizes, the performance gap narrows somewhat, but Accelerate continues to outperform OpenBLAS. For instance, at N=1024, Accelerate reaches 2483 GFLOP/s, whereas OpenBLAS caps at 423 GFLOP/s. Response times for Accelerate remain competitive but grow slightly as the matrix sizes increase. OpenBLAS exhibits much longer response times, especially at large matrix sizes (e.g., N=4096 and N=8192), where its mean response times are several times longer than Accelerate’s.

At the largest matrix size tested, N=8192, the performance difference shrinks further. Accelerate achieves 1628 GFLOP/s, while OpenBLAS achieves 574 GFLOP/s. This suggests that the performance efficiency of both libraries stabilizes at larger matrix sizes, though Accelerate is still around 3x faster. OpenBLAS shows significant increases in response times at these large matrix sizes, particularly with mean response times exceeding 2.2 seconds, while Accelerate's mean response time is around 0.7 seconds.

Why did Apple Built AMX?

The fact of the matter is that Apple waited on SVE to be matured to fully make the jump to adopting it, and AMX was simply a stopgap solution to address the shortcomings of Arm NEON.

Many have critiqued NEON for various quirks, but one of the more obscure but important critiques of NEON was with NEON’s lack of generic shuffle instructions like those found in x86 SSE. In AVX, shuffle operations are typically low-latency, single-cycle instructions. For instance, the VSHUFPS instruction, which _mm256_shuffle_ps maps to, has a latency of 1 cycle and a throughput of 1 cycle on many Intel architectures.

#include <immintrin.h>

__m128 shuffle_sse(__m128 a, __m128 b) {
    return _mm_shuffle_ps(a, b, _MM_SHUFFLE(3, 1, 2, 0));
}

But many who had experience with porting AVX code to NEON (especially with shuffle instructions) found that it often involves reimagining the data flow of the algorithm. Instead of relying on flexible, single-instruction shuffles, NEON code might use a combination of vector extracts, reverses, and interleaving operations to achieve the desired data arrangement.

The most straightforward way to do this in NEON might involve using a combination of vgetq_lane_f32 to extract individual lanes, vsetq_lane_f32 to set lanes, and vcopyq_lane_f32 to move data between vectors. For broadcasting a single lane to all elements, vdupq_lane_f32 can be used. However, this lane-by-lane approach can lead to suboptimal performance on ARM hardware.

#include <arm_neon.h>

float32x4_t shuffle_neon(float32x4_t a, float32x4_t b) {
    float32x2_t low_a = vget_low_f32(a);
    float32x2_t high_a = vget_high_f32(a);
    float32x2_t high_b = vget_high_f32(b);

    float32x2_t shuffled_a = vext_f32(low_a, high_a, 1);
    float32x4_t result = vcombine_f32(shuffled_a, high_b);

    result = vsetq_lane_f32(vgetq_lane_f32(b, 1), result, 2);

    return result;
}

To achieve better performance with NEON, programmers can leverage some instructions that can operate on multiple lanes simultaneously. One such instruction is vextq_f32, which allows extraction of components from two vectors, combining them based on a specified starting index. NEON also provides a family of reverse instructions, including REV16, REV32, and REV64, which can efficiently reverse the order of elements within specific portions of a vector. While each of these instructions typically has a 2 cycle latency, they can potentially replace multiple lane-specific operations, leading to overall better performance.

#include <arm_neon.h>

float32x4_t shuffle_neon(float32x4_t a, float32x4_t b) {
    float32x2_t low_a = vget_low_f32(a);
    float32x2_t high_a = vget_high_f32(a);
    float32x2_t high_b = vget_high_f32(b);

    float32x2_t shuffled_a = vext_f32(low_a, high_a, 1);
    float32x4_t result = vcombine_f32(shuffled_a, high_b);

    result = vsetq_lane_f32(vgetq_lane_f32(b, 1), result, 2);

    return result;
}

With Apple's AMX, there exist an instruction called genlut, which has two primary modes: lookup and generate. In lookup mode, it can perform operations similar to a combination of AVX512's vpshufb and vgatherps instructions. This allows for complex data rearrangement and gathering operations within a single instruction. The lookup mode supports various data types and lane counts, ranging from 8-bit to 64-bit elements, with lane counts varying from 8 to 64 depending on the data size.

The generate mode of Apple AMX's genlut instruction functions as an inverse operation to vpshufb, primarily used for quantization. In this mode, genlut reads a full 512-bit (64-byte) vector of data from the source and produces a packed vector of indices. The process involves searching the table register for each lane in the source vector. It finds the minimum value 'v' such that table[v] > source_lane, and then writes the index value 'v - 1', or '-1' if no such 'v' is found.

This operation supports arbitrary 2, 3, 4, or 5-bit quantization, depending on the mode selected. Modes 0, 2, 3, and 5 perform 4-bit quantization, allowing for 16 levels, while modes 1, 4, and 6 use 5-bit quantization, providing 32 levels. The resulting quantized indices are densely packed into the low bytes of an X or Y register, with the remaining bytes cleared to zero. This packed result can be used directly in subsequent lookup operations, enabling efficient piecewise linear approximations of complex functions.

The Future of AMX

Before Armv9 was even on the horizon, the usage of AMX itself was controversial. ARM typically does not allow custom ISA extensions and the main legal concern is that while developers can use AMX, ARM might take issue if non-Apple entities start using it in production, potentially leading to ARM restricting future chip functionality.

AMX also introduces a new EL0 state, requiring changes to the ARM64 kernel’s context-switching mechanisms to handle additional register states. However, integrating these changes directly into the core ARM64 architecture code is complex and may not be welcomed upstream due to the proprietary nature of AMX and potential objections from ARM.

With the introduction of the Apple M4 chip, Apple finally adopted the Armv9 ISA which also brings the support of SVE to Apple platforms.

The Elusive Apple Matrix Coprocessor (AMX)

Brief history of AMX

Technicals

Performance Characteristics

OpenBLAS (Arm NEON) Performance

Rawdogging Apple AMX

Interacting with Apple Accelerate

Why did Apple Built AMX?

The Future of AMX

Comments

More from this blog

Forensics on Network Appliances

Messing Around with GPUs Again

Deepseek's Low Level Hardware Magic

Behind Chrome-Based DLP Plugins

Command Palette

Brief history of AMX

Technicals

Performance Characteristics

OpenBLAS (Arm NEON) Performance

Rawdogging Apple AMX

Interacting with Apple Accelerate

Why did Apple Built AMX?

The Future of AMX

Comments

More from this blog