Cover Illustration by atomic_arctic

While doing matrixes on GPUs are kind of overplayed, they still represent one of the most important computation for modern AI workloads, making up the vast majority of FLOPS during both training and inference of deep learning models. Even if your kernels don’t get near cuBLAS performance, it will still teach you alot about low-level GPU architecture and their performance characteristics.

NVIDIA introduced Tensor Cores starting with the Volta generation of GPUs to accelerate applications represented by AI inference and training, which typically involve large-scale matrix multiplication or matrix-multiplication-like workloads. After all, CUDA Cores have limited computing power, and there would be a lot of memory bandwidth wasted on matrix multiplication, which is a typical compute-intensive workload. The addition of Tensor Cores allows GPUs to utilize their memory bandwidth of several hundred GB/s during matrix multiplication computations. Tensor Cores were later added to the RTX series of graphics card, with AI workloads starting to be more important for prosumers and gamers (technologies like DLSS and frame generation).

Each Streaming Multiprocessor (SM) in the Hopper architecture is a powerful compute unit containing both traditional CUDA cores and specialized Tensor Cores. In the full GH100 “Hopper” GPU (which powers H100), there are 144 SMs arranged across 8 GPCs (Graphics Processing Clusters), with 2 SMs per TPC (Texture Processing Cluster). The H100 SXM5 variant has 132 SMs enabled (the PCIe version has 114)developer.nvidia.com. Within each SM, there are 128 FP32 CUDA cores and four 4th-generation Tensor Cores. The CUDA cores handle standard scalar/thread-parallel operations (integer and floating-point ALU tasks), while the Tensor Cores are dedicated matrix-math units designed for massively parallel multiply-accumulate operations on matrices. These Tensor Cores are tightly integrated into the SM’s datapath and scheduling fabric, allowing warp-level matrix operations to execute alongside normal instructions.

Hopper Tensor Cores support a broad range of numerical formats including FP64, TF32, BF16, FP16, INT8, and most notably FP8—a new 8-bit format (E4M3 and E5M2) tailored for AI workloads. These data types operate under mixed-precision accumulation, e.g., FP8 inputs accumulating in FP16 or FP32, optimizing both performance and accuracy. While hybrid-precision formats are not new such as in Micikevicius et al. (2022) which use E4M3 in Fprop and E5M2 in Dgrad and Wgrad, newer architectures like DeepSeek adopts the E4M3 format universally. By operating on smaller element groups, their methodology effectively shares exponent bits among grouped elements, mitigating the impact of limited dynamic range.

Each Tensor Core executes matrix multiplications using warp-level HMMA instructions like HMMA.16x8x16 or HMMA.8x8x8, denoting matrix tile sizes and operand types. In Hopper, an SM can issue one such operation per cycle per Tensor Core, with four Tensor Cores per SM. Pipelining and warp scheduling allow latency hiding and high occupancy, especially when paired with TMA’s low-latency data loads and async barriers. This design minimizes idle cycles and register pressure while sustaining high throughput.

Programming Tensor Cores

When you're writing Tensor Core code, you're really working with three different levels at once. The first layer is the C++ API that wraps everything in nice clean abstraction, the next one is the PTX instructions that give you precise control over what the hardware does, then the actual SASS machine code that runs on the silicon.

At the top level, NVIDIA gives us the WMMA (Warp Matrix Multiply Accumulate) API in CUDA C++. It's defined in mma.h under your CUDA include directory (typically CUDA/v12.0/include/crt/mma.h). The basic building block here is something called a "fragment" - think of it as a chunk of a matrix that maps nicely to what the Tensor Cores can process. Each fragment represents a warp-collective operation where all 32 threads in a warp work together, each holding specific matrix elements in their registers.

// First we declare our fragments - these map to registers in a warp
nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, 16,16,16, half, nvcuda::wmma::row_major> frag_a;
nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, 16,16,16, half, nvcuda::wmma::row_major> frag_b;
nvcuda::wmma::fragment<nvcuda::wmma::accumulator, 16,16,16, half> frag_c;

// Initialize the accumulator fragment to zero
nvcuda::wmma::fill_fragment(frag_c, 0.0f);

// Load matrix data from memory into fragments
nvcuda::wmma::load_matrix_sync(frag_a, ptrA, strideA);
nvcuda::wmma::load_matrix_sync(frag_b, ptrB, strideB);

// Do the matrix multiply-accumulate
nvcuda::wmma::mma_sync(frag_c, frag_a, frag_b, frag_c);

// Store results back to memory
nvcuda::wmma::store_matrix_sync(ptrC, frag_c, strideC, nvcuda::wmma::mem_row_major);

Those template parameters tell the compiler exactly what size of matrix operation we want. On Hopper, we can do 16x16x16 for FP16/BF16, 16x16x32 for the new FP8 format, and 16x8x32 for TF32 and several other combinations. The available sizes have evolved: Volta introduced 16x16x16, Turing added 16x8x8, Ampere brought 16x8x16 with sparsity support, and Hopper expanded to include FP8 configurations.

When we compile that C++ code, the journey begins with NVCC. First, your .cu file gets preprocessed using cl.exe on Windows or gcc on Linux, expanding macros and handling includes, creating a .cpp4.ii file. Then cudafe++ analyzes this preprocessed code to separate device and host portions, generating .cudafe1.cpp for device code. This separation is crucial - host code will be compiled by your normal C++ compiler, while device code needs special handling.

Now we enter the PTX phase. PTX is NVIDIA's intermediate representation - kind of like GPU assembly but not quite machine code yet. For those familiar with LLVM, PTX shares similarities with LLVM's Intermediate Representation (IR). While LLVM's project scope has expanded beyond its original "virtual machine" naming, its core concept of IR remains analogous to PTX. IR acts as a bridge between frontend programming languages and backend machine code, simplifying support for new languages and hardware targets while enabling cross-platform optimizations. PTX serves as NVIDIA's "CUDA IR," connecting high-level CUDA C++ code with low-level GPU SASS instructions. This abstraction allows NVIDIA to implement runtime optimizations via tools like NVRTC and generate device-agnostic code.

The cicc compiler takes our separated device code and generates PTX, using virtual architecture flags (like -arch compute_70) to determine what features are available. This creates a .ptx file containing instructions that are still somewhat readable but much closer to the hardware:

// FP8 (new in Hopper!)
mma.sync.aligned.m16n16k32.row.col.f16.e4m3.e4m3.f16 d, a, b, c;  // 4-bit exp, 3-bit mantissa
mma.sync.aligned.m16n16k32.row.col.f16.e5m2.e5m2.f16 d, a, b, c;  // 5-bit exp, 2-bit mantissa

// FP16
mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 d, a, b, c;

// TF32
mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 d, a, b, c;

Each part of these PTX instructions has meaning: 'sync' means all threads participate, 'aligned' indicates memory alignment requirements, the shape (m16n8k16) defines matrix dimensions, and the data types specify the precision for each operand.

The ldmatrix instruction is fascinating - it's way more flexible than the old wmma.load. Instead of being locked into loading data with a fixed stride, we can specify exactly where each group of 4 threads should load from. This is crucial for avoiding shared memory bank conflicts:

ldmatrix.sync.aligned.x4.m8n8.shared.b16 rd, [addr];

Though CUDA developers may not directly interact with PTX, it plays a critical role under the hood. When compiling CUDA code with NVCC, .ptx files are generated during the device code compilation phase. These files represent the optimized intermediate code before final translation to SASS (the GPU's native instruction set). PTX also enables the magic of forward compatibility - code compiled today can run on future GPUs through JIT compilation.

The next step is where ptxas comes in. This assembler takes PTX and generates actual machine code (SASS) for specific GPU architectures, using physical architecture flags like -arch=sm_90. The output is a .cubin file containing binary code that can run directly on the target GPU.

Finally, we get to what actually runs on the hardware: SASS instructions. On Hopper (compute capability 9.0), there are specialized HMMA (Half-precision Matrix Multiply-Accumulate) instructions for each data format:

HMMA.1616.F16  // for FP16/BF16
HMMA.1632.F8   // for the new FP8 format
HMMA.168.F32   // for TF32
LDSM.16.M88.4  // Load from Shared Memory for matrix operations

What's cool is each encoding is tuned for its specific format. The FP8 instruction (HMMA.1632.F8) can process twice as much data per instruction because the numbers are half the size. This is how Hopper achieves that 4x speedup for FP8 operations compared to FP16 on previous generations.

The journey doesn't end with a single cubin file. fatbinary packages together multiple architecture versions (like sm_70, sm_80, sm_90 cubins) along with PTX code for forward compatibility, creating a .fatbin file. This gets embedded into your host executable, allowing the CUDA runtime to select the best version at runtime based on the actual GPU present. The host code, meanwhile, has been compiled by your regular C++ compiler and linked with CUDA runtime libraries. The final executable contains both host code and the embedded fatbinary, ready to run on a variety of GPU architectures.

For fp16 type, the matrix load and matrix multiplication operations mentioned above are compiled into LSDM instructions and HMMA instructions. These SASS instructions directly map to the physical Tensor Core hardware units, with each SM containing multiple Tensor Cores that can execute these specialized matrix operations in parallel.

Implementations

Naive Kernel

The baseline implementation utilizes the C API for Tensor Cores, but with alternative tiling parameters BM = 64, BN = 128, BK = 64, and 256 threads per block. This configuration provides different memory access patterns and computational balance compared to the more common BM = 128, BN = 256, BK = 32 approach.

__global__ void hgemm_naive(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    const int M, const int N, const int K) {
    const int BM = 64;
    const int BN = 128; 
    const int BK = 64;

    int bx = blockIdx.x;
    int by = blockIdx.y;
    int tid = threadIdx.x;
    int wid = tid >> 5;

    const int APAD = 8;
    const int BPAD = 8;

    __shared__ half s_a[BM][BK + APAD];
    __shared__ half s_b[BK][BN + BPAD];

    // Using a 3x2 grid of fragments per warp instead of 4x4
    // This maps differently to the output matrix based on the new tiling dimensions
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> frag_a[2][3];
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> frag_b[2][3];
    wmma::fragment<wmma::accumulator, 16, 16, 16, half> frag_c[3][2];

    // Initialize accumulator fragments to zero
    #pragma unroll
    for (int i = 0; i < 3; i++) {
        #pragma unroll
        for (int j = 0; j < 2; j++) {
            wmma::fill_fragment(frag_c[i][j], 0.0);
        }
    }

    // Thread mapping with BM=64, we need different load calculations
    int load_a_smem_m = (tid % 16) * 4; // Interleaved row assignment
    int load_a_smem_k = (tid / 16) * 4; // Distribute across K dimension
    int load_b_smem_k = (tid / 32) * 8; // Different mapping for larger BK
    int load_b_smem_n = (tid % 32) * 4; // Distribute within a warp across N

    int load_a_gmem_m = by * BM + load_a_smem_m;
    int load_b_gmem_n = bx * BN + load_b_smem_n;

    int load_a_gmem_addr = load_a_gmem_m * K + load_a_smem_k;
    int load_b_gmem_addr = load_b_smem_k * N + load_b_gmem_n;

    // Modified warp assignment for 3x2 grid
    // Each warp computes a 48x32 output tile (vs 64x64 in the original)
    int comp_c_frag_m = (wid / 2) * 3; // Maps to beginning of vertical fragment group
    int comp_c_frag_n = (wid % 2) * 2; // Maps to beginning of horizontal fragment group

    for (int bk = 0; bk < K / BK; bk++) {
        // Load A tile into shared memory (4 elements per thread)
        if (load_a_gmem_m < M && load_a_smem_k < BK) {
            FLOAT4(s_a[load_a_smem_m][load_a_smem_k]) = 
                FLOAT4(a[load_a_gmem_addr]);
        }

        // Load B tile into shared memory (4 elements per thread)
        if (load_b_smem_k < BK && load_b_gmem_n < N) {
            FLOAT4(s_b[load_b_smem_k][load_b_smem_n]) = 
                FLOAT4(b[load_b_gmem_addr]);
        }

        load_a_gmem_addr += BK;
        load_b_gmem_addr += BK * N;

        __syncthreads();

        // Load matrix data from shared memory into Tensor Core fragments
        // 3x2 fragment grid requires different addressing pattern
        #pragma unroll
        for (int i = 0; i < 3; i++) {
            wmma::load_matrix_sync(frag_a[0][i], 
                &s_a[(comp_c_frag_m + i) * 16][0], BK + APAD);
            wmma::load_matrix_sync(frag_a[1][i], 
                &s_a[(comp_c_frag_m + i) * 16][32], BK + APAD);
        }

        #pragma unroll
        for (int i = 0; i < 2; i++) {
            wmma::load_matrix_sync(frag_b[0][i], 
                &s_b[0][(comp_c_frag_n + i) * 16], BN + BPAD);
            wmma::load_matrix_sync(frag_b[1][i], 
                &s_b[32][(comp_c_frag_n + i) * 16], BN + BPAD);
        }

        // Execute matrix multiply-accumulate operations
        #pragma unroll
        for (int i = 0; i < 3; i++) {
            #pragma unroll
            for (int j = 0; j < 2; j++) {
                wmma::mma_sync(frag_c[i][j], frag_a[0][i], frag_b[0][j], frag_c[i][j]);
                wmma::mma_sync(frag_c[i][j], frag_a[1][i], frag_b[1][j], frag_c[i][j]);
            }
        }

        __syncthreads();
    }

    // Store results back to global memory
    int store_c_gmem_m = by * BM + (comp_c_frag_m * 16);
    int store_c_gmem_n = bx * BN + (comp_c_frag_n * 16);

    #pragma unroll
    for (int i = 0; i < 3; i++) {
        #pragma unroll
        for (int j = 0; j < 2; j++) {
            if (store_c_gmem_m + i * 16 < M && store_c_gmem_n + j * 16 < N) {
                wmma::store_matrix_sync(
                    &c[(store_c_gmem_m + i * 16) * N + (store_c_gmem_n + j * 16)], 
                    frag_c[i][j], N, wmma::mem_row_major);
            }
        }
    }
}

The kernel implements a 3×2 fragment grid per warp instead of the traditional 4×4, resulting in 48×32 output tiles per warp. It also uses interleaved row assignment for matrix A to better balance the workload across threads. Then it includes bounds checking to handle edge cases, important for matrices that aren't neatly divisible by the tile dimensions. Finally, it implements a different thread-to-data mapping to accommodate the modified tile dimensions.

The increased BK dimension (64 vs 32) reduces the number of iterations in the main loop, potentially increasing computational intensity at the expense of requiring more shared memory per iteration. This tradeoff can be beneficial depending on the matrix dimensions and hardware characteristics..

Asynchronous Copy (cp.async) and Pointer Conversions

The first significant optimization introduces asynchronous memory copy operations using PTX inline assembly. The key addition here is the use of __cvta_generic_to_shared() to convert generic pointers to shared memory addresses required by the PTX inline assembly.

This is a critical step because the pointer (8 bytes) obtained using SMEM is generic, directly using this 8-byte value as the address of shared memory may exceed the address range of shared memory.

__global__ void hgemm_async(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    const int M, const int N, const int K) {
    // Initial setup and declarations, same as naive kernel
    // ...
    int s_a_base_addr = __cvta_generic_to_shared(s_a[0]);
    int s_b_base_addr = __cvta_generic_to_shared(s_b[0]);
    int load_a_smem_addr_0 = s_a_base_addr + OFFSET(load_a_smem_m, load_a_smem_k, BK + APAD) * sizeof(half);
    int load_a_smem_addr_1 = load_a_smem_addr_0 + (BK + APAD) * sizeof(half);
    int load_b_smem_addr_0 = s_b_base_addr + OFFSET(load_b_smem_k, load_b_smem_n, BN + BPAD) * sizeof(half);
    int load_b_smem_addr_1 = load_b_smem_addr_0 + (BN + BPAD) * sizeof(half);
    int load_b_smem_addr_2 = load_b_smem_addr_0 + 2 * (BN + BPAD) * sizeof(half);
    int load_b_smem_addr_3 = load_b_smem_addr_0 + 3 * (BN + BPAD) * sizeof(half);

    for (int bk = 0; bk < K / BK; bk++) {
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_a_smem_addr_0), "l"(&a[load_a_gmem_addr        ]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_a_smem_addr_1), "l"(&a[load_a_gmem_addr +     K]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_0), "l"(&b[load_b_gmem_addr        ]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_1), "l"(&b[load_b_gmem_addr +     N]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_2), "l"(&b[load_b_gmem_addr + 2 * N]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_3), "l"(&b[load_b_gmem_addr + 3 * N]));

        load_a_gmem_addr += BK;
        load_b_gmem_addr += BK * N;

        asm ("cp.async.commit_group;\n" ::);
        asm ("cp.async.wait_group 0;\n" ::);

        __syncthreads();

        // Matrix loading and multiplication code remains the same as last one
        // ...
    }
}

Instead of using the FLOAT4 macro to load data, we now use the cp.async.ca.shared.global PTX instruction. This instruction initiates asynchronous memory transfers from global memory to shared memory. The transfers are grouped with cp.async.commit_group and then we wait for completion with cp.async.wait_group 0.

The implementation of asynchronous memory transfers alters the data movement pipeline in the GPU by leveraging specialized hardware capabilities available in Ampere and later architectures. The cp.async.ca.shared.global PTX instruction establishes a direct hardware path between global memory and shared memory, bypassing the general-purpose load/store units and eliminating the need for temporary register storage. This dedicated hardware path in Ampere reduces the memory latency and power consumption associated with transfers by removing intermediate register footprint and allocation overhead.

The async copy operations introduce a relaxed memory coherence model managed through explicit synchronization primitives. The cp.async.commit_group instruction bundles issued memory operations into a transaction group that can be collectively tracked. The cp.async.wait_group 0 instruction represents a strict synchronization point, forcing execution to stall until the most recently committed group completes. This waiting mechanism ensures correct access ordering but doesn't permit significant asynchronous overlap in this implementation.

Each thread issues fixed-width 16-byte (8 half-precision elements) transfers through the async copy instructions. With 256 threads, this allows a theoretical transfer capacity of 4KB per instruction set. The memory coalescing hardware combines multiple adjacent 16-byte requests into larger memory transactions when possible, reducing the number of actual DRAM operations required.

The "ca" (cache all) hint in cp.async.ca.shared.global indicates that data should be cached at all applicable levels of the memory hierarchy. This maximizes temporal locality benefits for this data pattern which features repeated access to the same memory regions across thread blocks.

Double Buffering

The next optimization implements double buffering to more effectively overlap computation with memory transfers. Double buffering represents a classical software pipelining technique that dramatically improves the overlap between computation and memory operations. By maintaining two buffers in shared memory for each input matrix, the kernel can simultaneously compute using data from one buffer while loading the next iteration's data into the alternate buffer. This effectively creates a two-stage pipeline where memory operations and computations occur in parallel.

The core idea is to use two separate buffers for the input matrices, alternating between them for loading and computing. The key change here is the use of dynamic shared memory with extern __shared__ half smem[];. We partition this memory to create two buffers each for matrices A and B. The offset variables s_a_db_offset and s_b_db_offset will be used to select the appropriate buffer.

__global__ void hgemm_doublebuff(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    const int M, const int N, const int K) {
    // similar initial setup as last one
    // ...
    extern __shared__ half smem[];
    half *s_a = smem;
    half *s_b = smem + 2 * BM * (BK + APAD);
    int s_a_db_offset = BM * (BK + APAD);
    int s_b_db_offset = BK * (BN + BPAD);

    // initial load into first buffer
    {
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_a_smem_addr_0), "l"(&a[load_a_gmem_addr        ]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_a_smem_addr_1), "l"(&a[load_a_gmem_addr +     K]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_0), "l"(&b[load_b_gmem_addr        ]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_1), "l"(&b[load_b_gmem_addr +     N]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_2), "l"(&b[load_b_gmem_addr + 2 * N]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_b_smem_addr_3), "l"(&b[load_b_gmem_addr + 3 * N]));

        asm ("cp.async.commit_group;\n" ::);
        asm ("cp.async.wait_group 0;\n" ::);
        __syncthreads();
    }

    for (int bk = 1; bk < K / BK; bk++) {
        int smem_sel = (bk & 1) ^ 1;
        int smem_sel_next = ((bk - 1) & 1) ^ 1;

        load_a_gmem_addr += BK;
        load_b_gmem_addr += BK * N;

        // load next iteration's data into alternate buffer
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_a_smem_addr_0 + smem_sel_next * s_a_db_offset * (int)sizeof(half)), "l"(&a[load_a_gmem_addr        ]));
        asm ("cp.async.ca.shared.global [%0], [%1], 16;\n" :
            : "r"(load_a_smem_addr_1 + smem_sel_next * s_a_db_offset * (int)sizeof(half)), "l"(&a[load_a_gmem_addr +     K]));
        // ... more async copy instructions for the next buffer

        // compute using the current buffer
        wmma::load_matrix_sync(frag_a[0][0], &s_a[smem_sel * s_a_db_offset + (comp_c_frag_m * 64     ) * (BK + APAD) +  0], BK + APAD);
        wmma::load_matrix_sync(frag_a[0][1], &s_a[smem_sel * s_a_db_offset + (comp_c_frag_m * 64 + 16) * (BK + APAD) +  0], BK + APAD);
        // ... more Tensor Core load operations

        // matrix multiply operations
        #pragma unroll
        for (int i = 0; i < 4; i++) {
            #pragma unroll
            for (int j = 0; j < 4; j++) {
                wmma::mma_sync(frag_c[i][j], frag_a[0][i], frag_b[0][j], frag_c[i][j]);
                wmma::mma_sync(frag_c[i][j], frag_a[1][i], frag_b[1][j], frag_c[i][j]);
            }
        }

        asm ("cp.async.commit_group;\n" ::);
        asm ("cp.async.wait_group 0;\n" ::);
        __syncthreads();
    }

    // Process final buffer
    // ...
}

The main computation loop now includes buffer selection logic. The variables smem_sel and smem_sel_next toggle between 0 and 1 to select the appropriate buffer for the current and next iterations. While we're computing using the current buffer (smem_sel), we're loading data into the next buffer (smem_sel_next).

The implementation uses dynamic shared memory (extern __shared__ half smem[]) to accommodate twice the storage required for each matrix tile. The total shared memory consumption increases to approximately 98 KB (2 (BMBK+APAD) + BK(BN+BPAD)) sizeof(half)), which exceeds the default 48 KB shared memory limit per SM. To enable this larger allocation, the kernel launch requires cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 98304).

The buffer management logic employs a clever bit manipulation technique to toggle between the buffers. The expressions (bk & 1) ^ 1 and ((bk - 1) & 1) ^ 1 alternate between 0 and 1 on consecutive iterations, providing the offset multipliers for addressing the appropriate buffer. This technique avoids branching and ensures deterministic access patterns.

From a performance modeling perspective, double buffering transforms the execution time from a sequential model T_total = T_memory + T_compute to an overlapped model T_total = max(T_memory, T_compute) + T_setup, where T_setup represents the initial loading phase. This optimization is particularly effective when T_memory and T_compute are similar in magnitude, which is often the case in matrix multiplication operations at this scale.

Cache Locality

The next optimization aims to improve locality in the L2 cache by changing how thread blocks are scheduled. This modification is particularly significant for large matrix dimensions:

__global__ void hgemm_localcache(
    half * __restrict__ a, half * __restrict__ b, half * __restrict__ c,
    const int M, const int N, const int K) {
    // ... (most of the code is same as before)
    // int bx = blockIdx.x; // Original
    int bx = blockIdx.z * gridDim.x + blockIdx.x; // New version
    if (bx >= N / BN || by >= M / BM)
        return;
    // ... (rest of the kernel remains the same)
}

And when launching the kernel:

const int BM = 128, BN = 256, BK = 32;
dim3 blockDim(256);
int BX = (N + BN - 1) / BN;
int BY = (M + BM - 1) / BM;
const int NSPLIT = 4096;
int split_num = (N + NSPLIT - 1) / NSPLIT;
dim3 gridDim((BX + split_num - 1) / split_num, BY, split_num);
cudaFuncSetAttribute(hgemm_localcache, cudaFuncAttributeMaxDynamicSharedMemorySize, 98304);
unsigned int dsmem = 2 * (BM * (BK + 8) + BK * (BN + 8)) * sizeof(half);
hgemm_localcache<<<gridDim, blockDim, dsmem>>>(a, b, c, M, N, K);

This modification changes the order in which thread blocks are scheduled, promoting better locality for matrices A and B in the L2 cache. The L2 cache locality optimization addresses a fundamental scheduling challenge in massively parallel matrix multiplication. By default, CUDA schedules thread blocks in a grid by incrementing the x-dimension first, then y, and finally z. This pattern works well for matrix A's spatial locality but leads to poor locality for matrix B when processing large matrices.

Consider a large matrix multiplication where C(16384×16384) = A(16384×16384) × B(16384×16384). With BM=128 and BN=256, the output matrix C is divided into 128×64 tiles. Traditional scheduling would process all the tiles in the first row of C (accessing corresponding rows of A and columns of B), then move to the second row, and so on. While this maintains good spatial locality for A (we reuse the same rows), we rapidly traverse different columns of B, leading to cache thrashing.

The modified scheduling introduces a parameter NSPLIT that changes how tiles are processed. Instead of traversing all N columns before moving to the next row, we process only NSPLIT/BN columns for each row before moving to the next row. This creates a more balanced access pattern that improves the spatial and temporal locality for both input matrices.

The optimal value for NSPLIT depends on hardware characteristics, particularly L2 cache size and matrix dimensions. Through empirical testing, NSPLIT=4096 was found to provide the best performance, likely because it balances the working set size to fit optimally in the L2 cache while maintaining sufficient parallelism.

This optimization is particularly effective for large matrices (approaching or exceeding 16384×16384 dimensions) where L2 cache misses become a significant performance bottleneck. For smaller matrices, the benefit is less pronounced as the working set already fits well in the cache hierarchy.

Vectorized Memory Access

Next, we'll tackle two optimizations: transposing matrix A in shared memory to enable auto-vectorization of SMEM loads, and vectorizing all global memory accesses using explicit vector datatypes. First, let's look at the transposition of matrix A. By transposing A in shared memory, we can load data using vectorized SMEM loads (LDS.128 in SASS), which provides better memory access performance:

// Transpose A during the GMEM to SMEM transfer
float4 tmp = reinterpret_cast<float4 *>(&A[innerRowA * K + innerColA * 4])[0];
As[(innerColA * 4 + 0) * BM + innerRowA] = tmp.x;
As[(innerColA * 4 + 1) * BM + innerRowA] = tmp.y;
As[(innerColA * 4 + 2) * BM + innerRowA] = tmp.z;
As[(innerColA * 4 + 3) * BM + innerRowA] = tmp.w;

// B doesn't need to be transposed, just vectorize the access
reinterpret_cast<float4 *>(&Bs[innerRowB * BN + innerColB * 4])[0] = 
    reinterpret_cast<float4 *>(&B[innerRowB * N + innerColB * 4])[0];

Looking at the assembly, we see that loading A into the registers, which used to be a 32b LDS load, is now also a 128b LDS.128 load, just like it had already been for B. This gives us approximately a 3% performance improvement.

Next, we vectorize all loads and stores from/to global memory using vector datatypes, specifically float4:

// Vectorized load from global memory using float4
float4 tmp = reinterpret_cast<float4 *>(&A[innerRowA * K + innerColA * 4])[0];

// We transpose A during the transfer from global to shared memory
As[(innerColA * 4 + 0) * BM + innerRowA] = tmp.x;
As[(innerColA * 4 + 1) * BM + innerRowA] = tmp.y;
As[(innerColA * 4 + 2) * BM + innerRowA] = tmp.z;
As[(innerColA * 4 + 3) * BM + innerRowA] = tmp.w;

// Vectorized load and store for matrix B
reinterpret_cast<float4 *>(&Bs[innerRowB * BN + innerColB * 4])[0] = 
    reinterpret_cast<float4 *>(&B[innerRowB * N + innerColB * 4])[0];

This leads to the 32b global memory load instructions (LDG.E) being replaced with 128b counterparts (LDG.E.128). Similarly, store operations also get vectorized.

The reinterpret_cast is used to promise the compiler that the float* pointers are 128b aligned, which is a requirement for using LDG.E.128. This is more efficient than manually unrolling the accesses because the compiler doesn't know that the pointer is aligned and can't use 128b loads otherwise.

Warp-Level Computation

Our final optimization focuses on warptiling, which adds another level of hierarchy between blocktiling and threadtiling. While blocks and threads are explicit in CUDA, warps are an implicit hardware concept. A warp consists of 32 threads with consecutive thread IDs that execute in lockstep.

Warptiling explicitly organizes computation around the warp structure, aligning better with the GPU's execution model:

__shared__ half s_a[BK][BK + BPAD];
__shared__ half s_b[BK][BN + BPAD];

// Fragments: 3×2 grid per warp
using namespace nvcuda::wmma;
fragment<matrix_a, 16,16,16, half, row_major> frag_a;
fragment<matrix_b, 16,16,16, half, row_major> frag_b;
fragment<accumulator,16,16,16, half> frag_c[3][2];

// Zero out all accumulator fragments
#pragma unroll
for (int i = 0; i < 3; i++) {
    #pragma unroll
    for (int j = 0; j < 2; j++) {
        fill_fragment(frag_c[i][j], 0.0f);
    }
}

// Compute thread-specific load offsets (example values shown)
int load_a_smem_m = (tid % 16) * 4; // A: row offset in shared memory
int load_a_smem_k = (tid / 16) * 4; // A: col offset
int load_b_smem_k = (tid / 32) * 8; // B: row offset in shared mem
int load_b_smem_n = (tid % 32) * 4; // B: col offset
int load_a_gmem_m = by * BM + load_a_smem_m;
int load_b_gmem_n = bx * BN + load_b_smem_n;
// (compute addresses in global memory for A and B here...)
// For brevity, assume we compute load_a_gmem_addr and load_b_gmem_addr properly
for (int bk = 0; bk < K / BK; bk++) {
    // Load a 64×BK tile of A into shared memory (each thread 4 elements)
    if (load_a_gmem_m < M && load_a_smem_k < K) {
        // e.g., s_a[load_a_smem_k][load_a_smem_m + (bk*BK)] = A[load_a_gmem_addr];
    }
    // Similarly load B tile into s_b...
    __syncthreads(); 
    // Compute 3×2 grid of WMMA multiplies
    #pragma unroll
    for (int i = 0; i < 3; i++) {
        load_matrix_sync(frag_a, s_a[i], SA_STRIDE);
        load_matrix_sync(frag_b, s_b, SB_STRIDE);
        mma_sync(frag_c[i][0], frag_a, frag_b, frag_c[i][0]);
        mma_sync(frag_c[i][1], frag_a, frag_b, frag_c[i][1]);
    }
    __syncthreads();
    // (swap buffers for double buffering, etc.)
}

Warptiling is beneficial as it aligns with the warp scheduling unit of the GPU, improving utilization of warp schedulers. It also addresses shared memory bank conflicts by organizing memory accesses along warp boundaries and improves register cache locality, especially on recent GPU architectures. It also provides better mapping to warp-level matrix operations in future tensor core instructions

The warptiling implementation makes explicit all levels of parallelism:

Blocktiling: Different blocks can execute in parallel on different SMs
Warptiling: Different warps can execute in parallel on different warp schedulers
Threadtiling: Instructions can execute in parallel via instruction-level parallelism

To understand the exact benefits of our PTX optimizations, let's examine the assembly code for our key kernels. When profiling the naive implementation, we find that most executed instructions are memory loads:

ld.shared.f32 %f91, [%r8+3456];
ld.shared.f32 %f92, [%r7+108];
fma.rn.f32 %f93, %f92, %f91, %f90;

In our PTX-optimized kernels, the cp.async.ca.shared.global instruction transforms into a series of vectorized loads at the SASS level:

LDS R26, [R35.X4+0x800] // a 32b load from As
LDS.128 R8, [R2]        // a 128b load from Bs
LDS.128 R12, [R2+0x20]
LDS R24, [R35.X4+0x900]

The vectorized shared memory access pattern is a key advantage of using PTX inline assembly, as it allows the compiler to generate more efficient SASS code. Similarly, our global memory access optimizations result in LDG.E.128 instructions that fully utilize memory bandwidth.

Conclusions

This is more of a basic overview of some low-level optimizations made with utilizing inline PTX ISA. But using PTX doesn’t necessarily mean that your code will automatically be supercharged. The modern NVCC compiler is so advanced that handwritten SASS code will likely have on par performance with handwritten PTX, except if you’re really good at writing it. The following HGEMM implementation didn’t even really surpass cuBLAS in any way, but it was a good thought exercise.

Messing Around with GPUs Again

Programming Tensor Cores

Implementations

Naive Kernel

Asynchronous Copy (cp.async) and Pointer Conversions

Double Buffering

Cache Locality

Vectorized Memory Access

Warp-Level Computation

Conclusions

More from this blog

Forensics on Network Appliances

Deepseek's Low Level Hardware Magic

The Elusive Apple Matrix Coprocessor (AMX)

Behind Chrome-Based DLP Plugins

Command Palette

Programming Tensor Cores

Implementations

Naive Kernel

Asynchronous Copy (cp.async) and Pointer Conversions

Double Buffering

Cache Locality

Vectorized Memory Access

Warp-Level Computation

Conclusions

More from this blog