Cover Illustration by onigiriice

There has been alot of copium about Deepseek-R1 leapfrogging ChatGPT-o1 in benchmarks, with many accusing Deepseek either lying about their capabilities or sanction-busting US export controls. Moreover, the whole panic is making people believe that NVIDIA no longer has a technical moat and we will all be running chinese GPUs soon.

Picking some knowledge from my unpublished article about crafting different SGEMM implementations and also some light reading in Zhihu and CSDN (frankly i was doing chinese mode before it was cool), I wanna quickly compile about the ways that Deepseek manages to carve efficiency gains from last-gen NVIDIA hardware. This will mainly focus on hardware optimizations, not architecture efficiencies with the model.

American Hardware Restrictions

Alot has been said about American export restrictions on AI chips, but what do they actually restrict for exports? When people talk about export restrictions, they are likely mentioning the Export Control Classification Number (ECCN) 3A090, introduced with the 2022 CHIPS and Science Act. This rule specifically restricts the export of datacenter chips that meet specific performance thresholds and features to Chinese and Russian entities. For the sake of skipping over the government techno-babble, ECCN 3A090 contains a few restrictions.

Total Processing Performance (TPP)

The primary metric for controlling AI chips is defined as TPP = 2 × MacTOPS × bit_length_operation, where MacTOPS measures multiply-accumulate operations per second and bit length refers to the numerical precision of operations (e.g., FP16, FP32).
Performance Density

Performance density is calculated as TPP divided by die area in square millimeters. This metric prevents circumvention through die size manipulation and accounts for miniaturization and efficiency improvements.
Memory Bandwidth

Memory bandwidth controls specifically target implementations using High Bandwidth Memory (HBM) and similar advanced memory architectures. These restrictions focus on memory bandwidth density thresholds, which are particularly relevant for AI accelerators utilizing stacked memory configurations. This category recognizes that memory bandwidth is often a key performance bottleneck in AI computation.
Transfer Rate Controls

Transfer rate restrictions apply to chips with an aggregate bidirectional transfer rate of ≥600 GB/s across all inputs and outputs, excluding volatile memory. This threshold covers both actual and programmable capabilities, including interfaces like PCIe and NVLink. The controls apply regardless of whether the transfer rate is achieved through a single interface or multiple combined interfaces.

So this shows that Deepseek does not only need to work through limitations in chip processing power, but also feature-set limitations that are specifically designed to prevent the aggregation of chips via the limitation of inter-chip bandwidth and networking capabilities.

There is alot of speculation (and even misinformation) about how Deepseek actually managed to squeeze performance out of the NVIDIA chips it still has, but alot of the information here is actually told in Deepseek-V3’s technical report paper which is the base model for Deepseek-R1.

Mixed Precision Training

Mixed precision training has been a popular way for Chinese LLM developers to work with chip restrictions. A notable implementation is with the Tencent Hunyuan-Large model, which utilized mixed precision training with the bfloat16 format which was introduced by Google Brain in 2019. bfloat16 (BF16) represents a 16-bit variant of the conventional IEEE 754 single-precision floating-point format (FP32).

While maintaining the dynamic range of FP32, BF16 features a truncated significand compared to FP16, enabling both memory efficiency and accelerated computation. Papers have show that mixed precision training can achieve up to 2.5x acceleration compared to full-precision training using FP32 on high-performance GPU architectures such as the NVIDIA A100. But the key advancement in DeepSeek-V3 is solving mixed-precision using FP8 on large-scale model training, which has been notoriously unstable.

Their implementation is a fine-grained quantization strategy that works at both tile and block levels to extend the dynamic range of the FP8 format. For activations, they implement tile-wise grouping with 1 × Nc elements, while for weights they use block-wise grouping with Nc × Nc elements. This granular approach to quantization helps mitigate the impact of outliers by adapting the scale according to smaller groups of elements, rather than using a global scaling factor.

The framework maintains most compute-dense operations in FP8, particularly the General Matrix Multiplication (GEMM) operations. These GEMM operations accept FP8 tensors as inputs and produce outputs in either BF16 or FP32 format. All three GEMMs associated with the Linear operator - forward pass (Fprop), activation backward pass (Dgrad), and weight backward pass (Wgrad) - are executed in FP8. This design theoretically doubles the computational speed compared to the original BF16 method.

DeepSeek recognizes that certain operators require higher precision due to their sensitivity to low-precision computations. They maintain the original precision (BF16 or FP32) for several components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators. This targeted retention of high precision ensures stable training dynamics.

To address the limited accumulation precision of FP8 GEMM on NVIDIA H800 GPUs (around 14 bits), DeepSeek implements a promotion to CUDA cores for higher precision. During Matrix Multiply-Accumulate (MMA) execution on Tensor Cores, intermediate results are accumulated using the limited bit width. Once an interval of Nc is reached, these partial results are copied to FP32 registers on CUDA cores, where full-precision FP32 accumulation is performed. Setting Nc = 128 elements, equivalent to 4 Warpgroup-level Matrix Multiply-Accumulates (WGMMAs), represents the minimal accumulation interval that significantly improves precision without substantial overhead.

In contrast to hybrid FP8 formats used in Micikevicius et al. (2022) (which use E4M3 in Fprop and E5M2 in Dgrad and Wgrad), DeepSeek adopts the E4M3 format universally. This is made possible by their fine-grained quantization strategy - by operating on smaller element groups, their methodology effectively shares exponent bits among grouped elements, mitigating the impact of limited dynamic range.

For calculating scale factors, DeepSeek employs online quantization rather than delayed quantization frameworks that maintain historical maximum absolute values. They calculate the maximum absolute value online for each 1x128 activation tile or 128x128 weight block, derive the scaling factor, and quantize to FP8 format immediately.

@triton.jit
def act_quant_kernel(x_ptr, y_ptr, s_ptr, BLOCK_SIZE: tl.constexpr):
    """
    Quantizes the input tensor `x_ptr` and stores the result in `y_ptr` and the scaling factor in `s_ptr`.

    Args:
        x_ptr (triton.Pointer): Pointer to the input tensor.
        y_ptr (triton.Pointer): Pointer to the output tensor where quantized values will be stored.
        s_ptr (triton.Pointer): Pointer to the output tensor where scaling factors will be stored.
        BLOCK_SIZE (tl.constexpr): The size of the block to be processed by each program instance.

    Returns:
        None
    """
    pid = tl.program_id(axis=0)
    offs = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    x = tl.load(x_ptr + offs).to(tl.float32)
    s = tl.max(tl.abs(x)) / 448.
    y = x / s
    y = y.to(y_ptr.dtype.element_ty)
    tl.store(y_ptr + offs, y)
    tl.store(s_ptr + pid, s)


def act_quant(x: torch.Tensor, block_size: int = 128) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Quantizes the input tensor `x` using block-wise quantization.

    Args:
        x (torch.Tensor): The input tensor to be quantized. Must be contiguous and its last dimension size must be divisible by `block_size`.
        block_size (int, optional): The size of the blocks to be used for quantization. Default is 128.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: A tuple containing:
            - The quantized tensor with dtype `torch.float8_e4m3fn`.
            - A tensor of scaling factors with dtype `torch.float32`.
    """
    assert x.is_contiguous(), 'Input tensor must be contiguous'
    assert x.size(-1) % block_size == 0, f'Last dimension size must be divisible by block_size (block_size={block_size})'
    y = torch.empty_like(x, dtype=torch.float8_e4m3fn)
    s = x.new_empty(*x.size()[:-1], x.size(-1) // block_size, dtype=torch.float32)
    grid = lambda meta: (triton.cdiv(x.numel(), meta['BLOCK_SIZE']), )
    act_quant_kernel[grid](x, y, s, BLOCK_SIZE=block_size)
    return y, s

For optimizer states, they adopt BF16 instead of FP32 to track first and second moments in the AdamW optimizer. However, master weights and gradients are retained in FP32 to ensure numerical stability throughout training. Similarly, for activation caching during backward passes, inputs of the Linear after the attention operator use a custom E5M6 data format, while inputs of the SwiGLU operator in MoE are stored in FP8 with their fine-grained quantization method.

For communication in MoE operations, activations are quantized to FP8 before MoE up-projections, compatible with FP8 Fprop in MoE up-projections. A similar strategy applies to activation gradients before MoE down-projections. Forward and backward combine components are maintained in BF16 to preserve training precision in critical parts of the pipeline.

Bidirectional Pipeline Scheduling

Bidirectional pipeline parallelism can be traced back to the 2021 paper "Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines" by Torsten Hoefler and Shigang Li from ETH Zurich. The bidirectional pipeline with cross-arrangement can reduce the bubble rate but doubles the memory usage for weights.

Despite efficiency gains, major parallel computing libraries like Megatron, Deepspeed, and Colossal AI haven't implemented it. They mostly stick to the simpler 1F1B (one forward, one backward) approach. Moreover, it was later superseded by other PP improvements. In 2021-2022, few organizations were training models at such large scales.

Doubling model weights in memory was impractical when attention training wasn't bottlenecked by sequence length. Unlike today's common 8k+ token pretraining, activation memory was a smaller concern then. With limited GPU resources, increasing batch size for better throughput was preferable to doubling memory usage.

But unlike other pipeline parallel approaches, DualPipe employs a bidirectional pipeline scheduling strategy that feeds micro-batches simultaneously from both ends of the pipeline. This approach significantly reduces pipeline bubbles - periods where hardware goes unused. The algorithm divides each chunk into four primary components: attention, all-to-all dispatch, MLP, and all-to-all combine. For backward chunks, both attention and MLP are further split into two parts: backward for input and backward for weights. The boundaries of transformer blocks in these chunks are intentionally misaligned to enable optimal overlapping.

The pipeline bubbles in DualPipe are mathematically expressed as:

$$\left(\frac{PP}{2} - 1\right)(F\&B + B - 3W)$$

where PP represents pipeline parallel size, F&B denotes the execution time of two mutually overlapped forward and backward chunks, B represents the execution time of a full backward chunk, and W denotes the execution time of a "backward for weights" chunk.

DualPipe's bidirectional pipeline scheduling feeds micro-batches simultaneously from both ends of the pipeline. In a system with 8 pipeline ranks and 20 micro-batches, the scheduling creates symmetrical batch processing patterns. The micro-batches in the reverse direction mirror those in the forward direction, with shared black borders indicating mutually overlapped computation and communication.

The memory efficiency of DualPipe requires storing 2 x PP + 1 compared to PP activations in traditional approaches. While DualPipe does maintain two copies of model parameters, this overhead is minimized in the context of large expert parallelism. Unlike other approaches like Chimera, DualPipe only requires pipeline stages and micro-batches to be divisible by 2, not requiring micro-batches to be divisible by pipeline stages.

The scheduling mechanism ensures that communication operations (dispatch and combine) for one micro-batch overlap with computation operations (attention and MLP) of another. This overlap extends to both all-to-all communication for expert parallelism and pipeline parallel communication. By maintaining computation-communication overlap at scale, DualPipe enables DeepSeek to employ fine-grained experts across nodes while effectively eliminating all-to-all communication overhead.

In the paper, Deepseek mentioned that

Although DualPipe requires keeping two copies of the model parameters, this does not significantly increase the memory consumption since we use a large EP size during training. Compared with Chimera (Li and Hoefler, 2021), DualPipe only requires that the pipeline stages and micro-batches be divisible by 2, without requiring micro-batches to be divisible by pipeline stages. In addition, for DualPipe, neither the bubbles nor activation memory will increase as the number of micro-batches grows

DeepSeek's DualPipe innovates by combining Zero Bubble Pipeline Parallelism, which splits backward passes into separate input gradient and weight gradient computations, with Chimera's approach of using two parallel streams of computation. This combination allows more efficient scheduling as while one stream is doing forward computations, the other stream can simultaneously perform backward computations.

The real breakthrough comes with how this dual-stream approach handles the all-to-all communication needed for Mixture-of-Experts (MoE) models. When using just a single stream, all-to-all communication for forward and backward passes must happen sequentially. But with dual streams, DualPipe can perform forward pass communication in one stream at the same time as backward pass communication in the other stream. This overlapping of communication dramatically improves efficiency, as the GPU isn't left waiting for one communication phase to finish before starting another.

The Magic PTX Instruction

In Deepseek’s recent open source week, it had released its implementation of its intra-GPU communication kernel tailored for MoE training and inference. In the paper, Deepseek mentions

Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

And it seems that Deepseek finally revealed the “customized PTX instruction” that they mentioned in the Deepseek-V3 paper, saying that its an “out-of-doc” instruction.

But a later edit to the readme later corrected this. It later elaborated that its a “behavior-out-of-doc” instruction, and not an undocumented one.

This was later again edited to provide more clarity.

But what is PTX? For those familiar with LLVM, PTX shares similarities with LLVM’s Intermediate Representation (IR). While LLVM’s project scope has expanded beyond its original “virtual machine” naming, its core concept of IR remains analogous to PTX. IR acts as a bridge between frontend programming languages and backend machine code, simplifying support for new languages and hardware targets while enabling cross-platform optimizations. PTX serves as NVIDIA’s "CUDA IR," connecting high-level CUDA C++ code with low-level GPU SASS instructions. This abstraction allows NVIDIA to implement runtime optimizations via tools like NVRTC and generate device-agnostic code.

Though CUDA developers may not directly interact with PTX, it plays a critical role under the hood. When compiling CUDA code with NVCC, .ptx files are generated during the device code compilation phase. These files represent the optimized intermediate code before final translation to SASS (the GPU’s native instruction set).

Many people touted Deepseek’s use of PTX as gamechanging, but using PTX doesn’t necessarily mean that your code will automatically be supercharged. The modern NVCC compiler is so advanced that handwritten SASS code will likely have on par performance with handwritten PTX, except if you’re really good at writing it.

MoE models have unique communication patterns because they dynamically route tokens to different experts that may be distributed across multiple GPUs and nodes. This creates an intense all-to-all communication pattern that can become a performance bottleneck. This is further complicated by sanctions, where the interconnect speeds are severely nerfed in H800 cards.

Specs	H100 SXM5	H800 SXM5
Double precision FP64	34 TFLOPS	1 TFLOPS
Double precision FP32	67 TFLOPS	1 TFLOPS
Single precision FP32	67 TFLOPS	67 TFLOPS
Memory Bandwidth	3.35TB/s	3.35TB/s
Interconnect Bandwidth	900 GB/s	400 GB/s
NVLink Link	18	8

The __ldg intrinsic in CUDA maps to the PTX instruction ld.global.nc, which loads data through the non-coherent texture cache path. This was originally introduced in the Kepler architecture to provide a higher bandwidth, read-only path for accessing global memory:

// Standard usage of __ldg
template <typename T>
__device__ __forceinline__ T read_with_ldg(const T* ptr) {
    return __ldg(ptr);
}

// Compiled to something like:
// ld.global.nc.f32 %f1, [%rd1];

The texture cache has historically been optimized for spatially local access patterns and offers higher bandwidth than the standard L1/L2 cache path. However, it comes with a crucial limitation: by design, it is non-coherent with global memory stores. The NVIDIA programming guide explicitly warns that the texture cache isn't kept coherent with respect to global memory writes within the same kernel execution.The texture cache has historically been optimized for spatially local access patterns and offers higher bandwidth than the standard L1/L2 cache path. However, it comes with a crucial limitation: by design, it is non-coherent with global memory stores. The NVIDIA programming guide explicitly warns that the texture cache isn't kept coherent with respect to global memory writes within the same kernel execution.

When examining the actual kernel code in internode.cu, Deepseek needed to efficiently move data between GPUs in the form of hidden states and routing information.

// From the combine_token function, they need to read data sent by other GPUs
auto recv_fn = [&](int src_rdma_rank, int slot_idx, int hidden_int4_idx) -> int4 {
    // If using __ldg, this could be:
    return __ldg(reinterpret_cast<const int4*>(rdma_channel_data.recv_buffer(src_rdma_rank) + 
                                                slot_idx * num_bytes_per_rdma_token) + 
                                                hidden_int4_idx);
};

DeepSeek's initial attempt with ldg was problematic for their MoE communication patterns. The data being communicated between experts on different GPUs needs to be immediately visible, but the non-coherent nature of ldg meant that some reads could return stale data, leading to incorrect computation results.

When the writer thread on GPU 1 executes global_buffer[my_idx] = data, it writes a value (42 in this example) to global memory. This write operation goes through the normal global memory access path, and the value is eventually updated in the device's global memory.

Meanwhile, on GPU 2, the reader thread tries to read this value using __ldg(&global_buffer[idx]). The crucial point is that __ldg accesses memory through the texture cache, which is optimized for read-only data patterns. By design, the texture cache is not kept coherent with global memory writes from other threads or GPUs.

Even though the value has been updated in global memory, the reader's texture cache may still contain a stale value (0 in this example) from before the write occurred. The NVIDIA programming model doesn't guarantee that the texture cache will be automatically refreshed to see updates made by other threads or GPUs without explicit synchronization.

CUDA uses a relaxed memory model where, without explicit synchronization, threads may see writes from other threads in an order different from what was executed. For most operations, developers use mechanisms like __syncthreads(), atomic operations, or memory fences to establish proper ordering.

For inter-GPU communication, which is DeepSeek's use case, memory ordering becomes even more complex. They use NVSHMEM (NVIDIA's implementation of OpenSHMEM) for communication.

// Example of NVSHMEM put operation in internode.cu
nvshmemx_int8_put_nbi_warp(rdma_channel_data.recv_buffer(rdma_rank) + rdma_slot_idx * num_bytes_per_rdma_token,
                           rdma_channel_data.send_buffer(dst_rdma_rank) + rdma_slot_idx * num_bytes_per_rdma_token,
                           num_bytes_per_rdma_token * num_chunked_tokens,
                           translate_dst_rdma_rank<kLowLatencyMode>(dst_rdma_rank, nvl_rank));
nvshmem_fence();

In this example, after this operation they need to read the data on the receiving GPU. If they use __ldg for these reads, they might get stale data from the texture cache even if the data has been properly transferred via NVSHMEM.

But DeepSeek discovered that using ld.global.nc.L1::no_allocate.L2::256B instead of the standard ld.global.nc instruction solves the coherency issue on Hopper architecture while maintaining performance benefits. Despite initially being called an “out-of-doc” instruction, this instruction is documented in NVIDIA's PTX ISA documentation.

The instruction's components:

ld.global: Loads data from the global memory space
.nc: Uses a non-coherent cache for the load (typically the texture cache)
.L1::no_allocate: Prevents the loaded data from being cached in L1 cache
.L2::256B: Prefetches 256 bytes of data into the L2 cache

#ifndef DISABLE_AGGRESSIVE_PTX_INSTRS
#define LD_NC_FUNC "ld.global.nc.L1::no_allocate.L2::256B"
#else
#define LD_NC_FUNC "ld.volatile.global"
#endif

template <>
__device__  __forceinline__ int ld_nc_global(const int *ptr) {
    int ret;
    asm volatile(LD_NC_FUNC ".s32 %0, [%1];" : "=r"(ret) : "l"(ptr));
    return ret;
}

The “undefined behavior” comes from using ld.global.nc to read volatile data. The .nc qualifier indicates that a non-coherent cache is used, which means the load operation might not see the most recent writes to the memory location. According to the PTX documentation, the texture cache (accessed via .nc) is designed for read-only data that doesn't change during a kernel's execution. Using it for volatile data (data that could be modified by other threads or processes) violates its intended use case.

Using .nc alters the memory coherence behavior. By design, the non-coherent cache does not maintain coherence with other memory accesses. But the PTX memory model documentation loads via .nc (which uses the texture cache path) are not guaranteed to see updates made by normal global memory operations in a timely or consistent fashion. Within the same kernel execution, if one thread writes to a global memory address and another thread (or the same thread later) tries to read that address using ld.global.nc, the read might fetch a stale value from the non-coherent cache.

The underlying reason is that global L1 caches (and the texture cache) are not coherent with each other for global memory updates. Official documentation notes that global memory is coherent at the L2 level only; multiple SMs’ L1 caches are not kept coherent for global data. So if one SM writes to a location (bypassing or evicting from its L1) and another SM has that location cached in its texture/L1 cache, the second SM can read a stale cached value. The driver/hardware will invalidate such caches only between kernel launches, not within a single kernel.

But DeepSeek discovered that despite this violation, adding the .L1::no_allocate qualifier makes this work correctly on Hopper. Their hypothesis is that on Hopper, the non-coherent cache is unified with L1, and using .L1::no_allocate prevents any stale data from persisting in cache. By bypassing L1 cache, each load must fetch fresh data from either L2 cache or global memory.

This specialized instruction provides several significant performance advantages. The non-coherent cache typically offers higher bandwidth than the standard global memory cache path. The instruction also prefetches 256 bytes of data into the L2 cache, which significantly improves sequential access patterns. Furthermore, by avoiding L1 cache pollution for data that won't be reused, it preserves L1 cache capacity for other frequently accessed data.

Where Deepseek Thinks Hardware Should Go

When news of R1 went mainstream, the news put investors into panic mode. But the fact of the matter is, R1 represents Deepseek finding that NVIDIA hardware can still be pushed way further than what US and Chinese labs are pushing. But the fact of the matter is they will run out of chips, they will run out of clever tricks to do with their current hardware. And given the pace in which the Chinese semiconductor industry is moving, coupled with the pure chokehold that NVIDIA’s CUDA has on the AI/ML industry, without new hardware they’re gonna run out of tricks soon.

This is easily found in their own paper (which i will assume in good faith all of you have read in full, not only snippets from Twitter you put into your bookmarks to never revisit ever again).

But what are the “development of more advanced hardware” Deepseek is looking for?

Higher FP8 GEMM Accumulation Precision

DeepSeek V3's groundbreaking achievement is the first successful implementation of FP8 training at extreme scale (671B parameters), achieving a relative loss error below 0.25% compared to BF16 baselines. But currently, FP8 GEMM operations on NVIDIA H800 GPUs have limitations on accumulation precision to approximately 14 bits, significantly below FP32 precision. This becomes particularly problematic with large inner dimensions (K), common in large-scale model training where batch size and model width are increased. Their testing shows GEMM operations with K=4096 can result in maximum relative errors approaching 2% due to limited accumulation precision in Tensor Cores.

They propose future hardware should either support full-precision accumulation in Tensor Cores or implement an appropriate accumulation bit-width based on training and inference accuracy requirements, eliminating the need for frequent data movement between Tensor and CUDA cores.

Tile and Block-Wise Quantization

DeepSeek's tile and block-wise quantization strategy directly addresses the primary challenge of FP8 training: managing outliers in activations and weights that can destabilize training due to FP8's limited dynamic range. Their innovation applies different quantization strategies for activations versus weights. For activations, they implement 1x128 tile-wise grouping (per token per 128 channels), while weights use 128x128 block-wise grouping (per 128 input/output channels). This granular approach allows better accommodation of outliers by adapting scaling factors to smaller groups of elements.

A key aspect of their implementation is the introduction of per-group scaling factors along the inner dimension of GEMM operations. While this functionality isn't directly supported in standard FP8 GEMM, they combine it with their precise FP32 accumulation strategy for efficient implementation. Their approach aligns with emerging hardware trends, as NVIDIA's next-generation Blackwell GPUs will support microscaling formats with smaller quantization granularity.

Transposed GEMM Operations

DeepSeek's current architecture faces inefficiencies in matrix transposition operations during training. During forward pass, activations are quantized into 1x128 FP8 tiles and stored. The backward pass requires reading out these matrices, dequantizing them, transposing them, re-quantizing into 128x1 tiles, and storing in HBM. This multi-step process creates significant memory operation overhead.

They propose enabling direct transposed reads of matrices from shared memory before MMA operations for precisions required in both training and inference. When combined with their proposed fusion of FP8 format conversion and TMA access, this would eliminate the current need for multiple memory operations and re-quantization steps. The optimization would be particularly impactful for their mixed-precision training framework where multiple matrix transformations are required between forward and backward passes.

Dedicated Inter-GPU Link Co-Processors

The need for specialized co-processors to handle inter-GPU communications arises from their current requirement to dedicate 20 out of 132 Streaming Multiprocessors (SMs) solely for communication tasks in their distributed MoE architecture. These SMs manage a complex communication system spanning InfiniBand for inter-node and NVLink for intra-node communication. This system, while effective, represents a 15% reduction in available computing power that could otherwise be used for model training.

A specialized co-processor would unify these networking domains, similar to NVIDIA's Scalable Hierarchical Aggregation Protocol (SHARP) which serves as a network co-processor (current inside certain Mellanox InfiniBand switches, yes, network switches).

Tthe co-processor could provide a unified interface for read, write, multicast, and reduce operations across the entire IB-NVLink-unified domain while maintaining near-zero all-to-all communication overhead. This would be particularly crucial for maintaining efficiency in their MoE architecture, where each token must be routed to up to 4 nodes without blocking subsequent token operations.

Conclusions

There has been alot that has been said about both sides of this discussion, China on the rise and the West has fallen. But almost all of the methods that Deepseek uses are built upon research done by Western teams, and trained on American technologies. If Deepseek used Huawei Ascend chips, I’d be singing a different tune.

I don’t understand why people thought this punctured NVIDIA’s dominance, PTX has an ISA documentation and some of the optimization methods are not entirely alien. People think hardware and compilers are now so fast that doing low-level optimizations is no longer worth it, when we know this is not the case.

People think that Deepseek somehow discovered PTX and now can bypass “NVIDIA’s CUDA Monopoly” when the PTX ISA is a domain-specific compiler IR, connecting high-level CUDA C++ code with low-level GPU SASS instructions. These low-level code optimizations are not interchangeable to AMD or even Huawei Ascend cards, as they run entirely different architectures. NVIDIA obscures their hardware's actual implementation details, even in absurd ways like how technical diagrams in NVIDIA’s technical documents only aim to explain the general structure of the architecture and may not accurately depict the exact implementation details of the hardware.

This is genius on NVIDIA’s part because that will either mean firms have to invest in low-level engineers that will have to dig through these GPUs in order to fully utilize them, or buy more chips. In scenario one, NVIDIA will waste precious time and resources of firms to hyper-optimize their code for NVIDIA’s stack which cements their vendor lock-in. In scenario two, NVIDIA gets more money from batch orders and win either way.

Most of the people talking about Deepseek on Twitter only saw the benchmarks or small snippets of the paper and jizz themselves seeing the LaTEX math formulas, pretending they even know a single word they’re being mentioned. Deepseek did what American teams didn’t, they don’t leave papers on bookmarks.

The conclusion is, less bookmarking and more reading, please.

Deepseek's Low Level Hardware Magic

American Hardware Restrictions

Mixed Precision Training

Bidirectional Pipeline Scheduling

The Magic PTX Instruction

Where Deepseek Thinks Hardware Should Go

Higher FP8 GEMM Accumulation Precision

Tile and Block-Wise Quantization

Transposed GEMM Operations

Dedicated Inter-GPU Link Co-Processors

Conclusions

Comments

More from this blog

Forensics on Network Appliances

Messing Around with GPUs Again

The Elusive Apple Matrix Coprocessor (AMX)

Behind Chrome-Based DLP Plugins

Command Palette

American Hardware Restrictions

Mixed Precision Training

Bidirectional Pipeline Scheduling

The Magic PTX Instruction

Where Deepseek Thinks Hardware Should Go

Higher FP8 GEMM Accumulation Precision

Tile and Block-Wise Quantization

Transposed GEMM Operations

Dedicated Inter-GPU Link Co-Processors

Conclusions

Comments

More from this blog